Staff Machine Learning Engineer | Large Scale AI Infrastructure
4 days ago
This position will sit within a company that is pioneering a new era of Biomedicine
Role Overview:
- GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.
- Distributed/Parallel Training: Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times.
- Performance Optimization: Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
- Deep Learning Framework Integration: Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks.
- Scalability and Resource Management: Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs.
- Troubleshooting and Support: Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges.
- Documentation: Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members.
Qualifications:
- Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
- Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
- Strong expertise in distributed deep learning and parallel training techniques.
- Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed.
- Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
- Knowledge of performance profiling and optimization tools for HPC and deep learning.
- Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).
- Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes).
- Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3+ years
The company will provide a relocation package for candidates open to relocate
-
Machine Learning Infrastructure Engineer
2 weeks ago
Palo Alto, California, United States Lanai Full timeThe RoleWe're looking for an ML and Data Science Engineer to help build the world's best enterprise AI platform that enables humans to do the extraordinary. You'll be working on exciting challenges such as LLM applications, Natural Language Understanding (NLU), domain adaptation, question answering, semantic search, and many more.Your expertise will be...
-
Machine Learning Infrastructure Specialist
2 weeks ago
Palo Alto, California, United States Tesla Full time**Accelerate Innovation with Tesla's Autopilot AI Team**We are seeking a highly skilled **Software Engineer - Model Scaling, Autopilot AI** to join our team at Tesla. As a key member of our Autopilot AI team, you will play a crucial role in optimizing and scaling our neural network training infrastructure.You will work closely with a specialized team of...
-
Large Scale Data Processing Expert
6 days ago
Palo Alto, California, United States Luma AI Full timeData Infrastructure SpecialistWe are seeking an experienced professional to fill the role of Senior Software Engineer - Data Infrastructure at Luma AI. In this position, you will be responsible for designing and building high-performance systems and pipelines for large-scale data processing. You will work closely with researchers to develop and implement...
-
Machine Learning Engineer
2 months ago
Palo Alto, United States Qualified Health Full timeJob DescriptionJob DescriptionJob Summary:We are seeking a highly skilled and experienced Machine Learning Engineer to lead our technical team. This role is ideal for someone who has a deep understanding of deploying scalable AI systems and has a track record of innovation and excellence in a high-stakes environment, preferably with both hyperscaler and...
-
AI Infrastructure Architect
7 days ago
Palo Alto, California, United States xAI Full timeTransforming AI with Scalable InfrastructureWe are seeking a highly skilled AI Infrastructure Architect to join our team at xAI. Located in the Bay Area, this role involves designing and developing cutting-edge AI infrastructure that enables our researchers to push the boundaries of what is possible.About the RoleThe successful candidate will be responsible...
-
Principal Machine Learning Engineer
1 month ago
Palo Alto, United States ZipRecruiter Full timeJob Summary: We are seeking a highly skilled and experienced Principal Machine Learning Engineer to lead our technical team. This role is ideal for someone who has a deep understanding of deploying scalable AI systems and has a track record of innovation and excellence in a high-stakes environment, preferably with both hyperscaler and startup experience. Key...
-
Principal Machine Learning Engineer
3 weeks ago
Palo Alto, United States Qualified Health Full timeJob DescriptionJob DescriptionJob Summary:We are seeking a highly skilled and experienced Principal Machine Learning Engineer to lead our technical team. This role is ideal for someone who has a deep understanding of deploying scalable AI systems and has a track record of innovation and excellence in a high-stakes environment, preferably with both...
-
Principal Machine Learning Engineer
1 week ago
Palo Alto, United States ZipRecruiter Full timeJob DescriptionJob Description Job Summary: We are seeking a highly skilled and experienced Principal Machine Learning Engineer to lead our technical team. This role is ideal for someone who has a deep understanding of deploying scalable AI systems and has a track record of innovation and excellence in a high-stakes environment, preferably with both...
-
Principal Machine Learning Engineer
1 month ago
Palo Alto, United States ZipRecruiter Full timeJob Summary:We are seeking a highly skilled and experienced Principal Machine Learning Engineer to lead our technical team. This role is ideal for someone who has a deep understanding of deploying scalable AI systems and has a track record of innovation and excellence in a high-stakes environment, preferably with both hyperscaler and startup experience.Key...
-
Machine Learning Infrastructure Engineer
2 weeks ago
Palo Alto, California, United States Qualified Health Full timeQualified Health is seeking an experienced MLOps Engineer to join our team and play a key role in designing, implementing, and maintaining infrastructure for deploying and managing advanced gen-AI agents and workflows powered by large language models.">About the Role">This position requires collaboration with data scientists and engineers to translate...
-
Machine Learning Infrastructure Engineer
5 days ago
Palo Alto, California, United States AiDash Full timeAbout UsAiDash is making waves in the climate tech space, helping critical infrastructure industries transition to a more sustainable future. Our innovative approach combines satellite data and AI to provide actionable insights, empowering customers to reduce costs, improve reliability, and meet their sustainability goals.As a leading player in the industry,...
-
Palo Alto, California, United States xAI Full timeAbout xAIxAI is a cutting-edge artificial intelligence organization dedicated to creating sophisticated AI systems that can accurately understand the universe and contribute to humanity's pursuit of knowledge.We are a small, highly motivated team focused on engineering excellence. Our organization is ideal for individuals who appreciate challenging...
-
Machine Learning Infrastructure Engineer
5 days ago
Palo Alto, California, United States Tesla Full timeJob DescriptionThe role of a Software Engineer on Tesla's Autopilot AI team involves optimizing and scaling our neural network training infrastructure. This position requires expertise in designing, implementing, and maintaining high-performance applications for neural network training, evaluation, and data processing pipelines.ResponsibilitiesData Pipeline...
-
Cloud Native Machine Learning Engineer
5 days ago
Palo Alto, California, United States Match Group Full timeJob DescriptionWe're looking for a talented Sr. Software Engineer to join our Machine Learning infrastructure team. As a key member of our Engineering team, you'll be responsible for designing and implementing scalable and robust infrastructure to support our machine learning engineers across all business units. Your work will enable teams to rapidly test...
-
Founding Engineer, Machine Learning
1 month ago
Palo Alto, United States Lanai Software Full timeAbout Lanai Lanai Software is at the forefront of the GenAI revolution, empowering humans to achieve the extraordinary in the age of AI. Lanai's AI platform empowers large organizations and employees to discover, protect, and accelerate AI adoption, fostering innovation and unlocking unprecedented productivity. Backed by top investors, we're creating a world...
-
Machine Learning Infrastructure Specialist
2 weeks ago
Palo Alto, California, United States Tesla Full timeAs a member of Tesla's cutting-edge team, you will play a pivotal role in optimizing and scaling our neural network training infrastructure. You will collaborate closely with world-class ML Researchers and Engineers to tackle unique challenges at the intersection of AI and ML training accelerators.Key ResponsibilitiesWork with machine learning Researchers...
-
Machine Learning Infrastructure Specialist
5 days ago
Palo Alto, California, United States Inflection AI Full timeJob Description and RequirementsThe Machine Learning Software Engineer role plays a crucial part in integrating ML frameworks and models into our platform for enterprise applications. This involves developing, deploying, and optimizing ML models, ensuring seamless integration with backend systems and APIs to deliver robust enterprise solutions.This position...
-
Machine Learning Engineer/Researcher
4 weeks ago
Palo Alto, United States PlayHT Full timeAbout Us:PlayAI is at the forefront of generative voice and conversational LLMs. With our Speech Synthesis and Voice Cloning models, we are building the SOTA conversational AI products.We are building a platform and infrastructure for Conversational AI Voice Agents so that every business, developer, or tinkerer can easily build talking human-like AI agents...
-
palo alto, United States Glocomms Full timeThis position will sit within a company that is pioneering a new era of Biomedicine! Role Overview:GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.Distributed/Parallel Training: Apply distributed...
-
palo alto, United States Glocomms Full timeThis position will sit within a company that is pioneering a new era of Biomedicine! Role Overview:GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization.Distributed/Parallel Training: Apply distributed...