Staff Software Engineer
2 weeks ago
GEICO . For more information, please .Staff Software Engineer - AI/ML Infra page is loaded## Staff Software Engineer - AI/ML Infralocations: Chevy Chase, MD: New York City, NY: Palo Alto, CAtime type: Full timeposted on: Posted Yesterdayjob requisition id: R0060146**At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities.****Every day we honor our iconic brand by offering quality coverage to millions of customers and being there when they need us most. We thrive through relentless innovation to exceed our customers’ expectations while making a real impact for our company through our shared purpose.****When you join our company, we want you to feel valued, supported and proud to work here. That’s why we offer The GEICO Pledge: Great Company, Great Culture, Great Rewards and Great Careers.**GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Platform Engineer to build and scale our machine learning infrastructure with a focus on Large Language Models (LLMs) and AI applications. This role combines deep technical expertise in cloud platforms, container orchestration, and ML operations with strong leadership and mentoring capabilities. You will be responsible for designing, implementing, and maintaining scalable, reliable systems that enable our data science and engineering teams to deploy and operate LLMs efficiently at scale. The candidate must have excellent verbal and written communication skills with a proven ability to work independently and in a team environment.KEY RESPONSIBILITIESML Platform & Infrastructure* Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)* Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization* Design, implement, and maintain feature stores for ML model training and inference pipelines* Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions* Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures* Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances* Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps* Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions* Ensure ML platforms meet enterprise security standards and regulatory compliance requirements* Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use casesDevOps & Platform Engineering* Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools* Implement automated model training, validation, deployment, and monitoring workflows* Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards* Continuously optimize platform performance, reducing latency and improving throughput for ML workloads* Design and implement backup, recovery, and business continuity plans for ML platformsTechnical Leadership & Mentoring* Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations* Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability* Design and deliver technical onboarding programs for new team members joining the ML platform team* Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures* Create technical documentation, runbooks, and deliver internal training sessions on platform capabilitiesCross-Functional Collaboration* Work closely with data scientists to understand requirements and optimize workflows for model development and deployment* Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications* Support research teams with infrastructure for experimenting with cutting-edge LLM techniques and architectures* Present technical solutions and platform roadmaps to leadership and cross-functional stakeholdersREQUIRED QUALIFICATIONSExperience & Education* Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)* 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps* 3+ years of hands-on experience with machine learning infrastructure and deployment at scale* 2+ years of experience working with Large Language Models and transformer architecturesTechnical Skills - Core Requirements* Proficient in Python; strong skills in Go, Rust, or Java preferred* Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)* Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling* Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)* Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)* Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similarDevOps & Platform Skills* Advanced experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms* Proficiency with Terraform, ARM templates, Pulumi, or CloudFormation* Deep understanding of Docker, container optimization, and multi-stage builds* Experience with Prometheus, Grafana, ELK stack, Azure Monitor, and distributed tracing* Knowledge of both SQL and NoSQL databases, data warehousing, and vector databasesLeadership & Soft Skills* Demonstrated track record of mentoring engineers and leading technical initiatives* Experience leading design reviews with focus on compliance, performance, and reliability* Excellent ability to explain complex technical concepts to diverse audiences* Strong analytical and troubleshooting skills for complex distributed systems* Experience managing cross-functional technical projects and coordinating with multiple stakeholdersPREFERRED QUALIFICATIONSAdvanced Experience* Master’s degree in computer science, Machine Learning, or related field* 8+ years of platform engineering or infrastructure experience* Experience with Staff Engineer or Tech Lead roles in ML/AI organizations* Background in distributed systems and high-performance computing* Open-source contributions to ML infrastructure projects or LLM frameworksSpecialized Skills* Multi-Cloud Experience: Hands-on experience with Azure, AWS (SageMaker, EKS) and/or GCP (Vertex AI, GKE)* Experience with specialized hardware (A100s, H100s, TPUs, TEEs) and optimization* RLHF & Fine-tuning: Experience with Reinforcement Learning from Human Feedback and LLM fine-tuning workflows* Experience with Milvus, Pinecone, Weaviate, Qdrant, or similar vector storage solutions* Deep experience with MLflow, Kubeflow, DataRobot, or similar platformsIndustry Knowledge* Understanding of AI safety principles, model governance, and regulatory compliance* Background in regulated industries with understanding of data privacy requirements* Experience supporting ML research teams and academic partnerships* Deep understanding of GPU optimization, memory management, and high-throughput systemsHybrid- (2 days a week)**Annual Salary**$115,000.00 - $300,000.00The above annual salary range is a general guideline. Multiple factors are taken into consideration to arrive at the final hourly rate/ annual salary to be offered to the selected candidate. Factors include, but are not limited to, the scope and responsibilities of the role, the selected candidate’s work experience, education and training, the work location as
#J-18808-Ljbffr
-
Senior Staff Engineer, Software Engineering
2 weeks ago
Palo Alto, United States ExecutivePlacements.com Full timeSenior Staff Engineer, Software Engineering (SRE Availability, Incident & Change Management) Our Senior Staff Engineer works with our Staff and Sr. Engineers to innovate and build new systems, improve and enhance existing systems, and identify new opportunities to apply your knowledge to solve critical problems. You will lead the strategy and execution of a...
-
Staff Software Engineer
2 weeks ago
Palo Alto, California, United States Navan Full time $146,250 - $255,000The Staff Full-stack Software Engineer in Security will be responsible for securing Navan products by identifying unaddressed areas of weakness and driving cleverly engineered, scalable solutions that improve our defense-in-depth. You will be responsible for design and development of core services related to authentication, authorization, encryption within...
-
Staff Software Engineer
2 weeks ago
Palo Alto, California, United States Navan Full time $146,250 - $255,000 per yearThe Staff Full-stack Software Engineer in Security will be responsible for securing Navan products by identifying unaddressed areas of weakness and driving cleverly engineered, scalable solutions that improve our defense-in-depth. You will be responsible for design and development of core services related to authentication, authorization, encryption within...
-
Staff Software Engineer
1 week ago
Palo Alto, United States Broadcom Inc. Full timePlease Note :1. If you are a first time user, please create your candidate login account before you apply for a job. (Click Sign In > Create Account)2. If you already have a Candidate Account, please Sign-In before you apply.Job Description:Description:The Tanzu division in Broadcom is seeking a Staff Software Engineer to join the Bosh Ecosystem team. The...
-
Staff Software Engineer
3 weeks ago
Palo Alto, United States Ford Full timeFord Model E Platform Architecture Engineering is looking for a Staff embedded software engineer. In this role you will be responsible for the development of automotive software solutions and embedded software modules for electric vehicles developed by Ford. You will be expected to pivot quickly on new ideas. Failing fast will be normal. What you’ll...
-
Staff Fullstack Software Engineer
2 weeks ago
Palo Alto, United States Navan Full timeThe Staff Fullstack Software Engineer in Security will be responsible for securing Navan products by identifying unaddressed areas of weakness and driving cleverly engineered, scalable solutions that improve our defense-in-depth. You will be responsible for design and development of core services related to authentication, authorization, encryption within...
-
Staff Software Engineer
3 weeks ago
Palo Alto, United States GEICO Full timeBase pay range$100,000.00/yr - $230,000.00/yr At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities. Every day we honor our iconic brand by offering quality coverage to millions of customers and being there when they need us most. We thrive through relentless innovation to exceed our customers’ expectations while...
-
Staff Software Engineer
2 weeks ago
Palo Alto, California, United States Hippocratic AI Full time $150,000 - $250,000 per yearAbout UsHippocratic AI is developing the first safety-focused Large Language Model (LLM) for healthcare. Our mission is to dramatically improve healthcare accessibility and outcomes by bringing deep healthcare expertise to every person. No other technology has the potential for this level of global impact on health.Why Join Our Team Innovative mission: We...
-
Staff Software Engineer
4 days ago
Palo Alto, CA, United States Globality Inc Full timeGlobality was founded with a simple yet ambitious goal: to use AI to transform enterprise spending into a smarter, fairer process-creating more efficient and inclusive markets worldwide. Nearly a decade later, our AI-driven solution is reshaping how enterprises spend, turning procurement into a guided, insight-led process that's easier for everyone, open to...
-
Staff Software Engineer
2 days ago
Palo Alto, CA, United States Globality Inc Full timeGlobality was founded with a simple yet ambitious goal: to use AI to transform enterprise spending into a smarter, fairer process-creating more efficient and inclusive markets worldwide. Nearly a decade later, our AI-driven solution is reshaping how enterprises spend, turning procurement into a guided, insight-led process that's easier for everyone, open to...