Senior MLOps Engineer, GenAI Framework

4 days ago


Santa Clara, California, United States NVIDIA Full time

NVIDIA is seeking a senior build and continuous integration (CI/CD) engineer for its GenAI Frameworks (NeMo, Megatron Core) team.

NVIDIA NeMo is an open-source, scalable, and cloud-native framework built for researchers and developers working on Large Language Models (LLM), Multimodal (MM), and Speech AI.

NeMo provides end-to-end model training, including data curation, alignment, customization, evaluation, deployment, and tooling to optimize performance and user experience.

Building upon modern DevOps tools, your work will enable GenAI framework software engineers and deep learning algorithm engineers to work efficiently with a wide variety of deep learning algorithms and software stacks as they seek out opportunities for performance optimization and continuously deliver high-quality software.

Key Responsibilities:

  • Architect and lead the build-release continuous integration processes of our Generative AI framework and libraries related to NeMo framework and Megatron Core.
  • Propose, implement, and deploy efficient and scalable DevOps solutions to allow our fast-growing team to release software more frequently while maintaining high-quality and top performance.
  • Work with industry-standard tools (Kubernetes, Docker, Slurm, Ansible, GitLab, GitHub Actions, Jenkins, Artifactory, Jira).
  • Assist with cluster operations and system administration (managing servers, team accounts, clusters).
  • Automate away recurring tasks (DL algorithm accuracy and performance regression detection, designing and developing new quality control measures, e.g., code analysis) while employing and advancing best-practices.
  • Work closely with DL framework and libraries (CUDA, cuDNN, cuBLAS) team and with other relevant teams within NVIDIA that provide software build, testing, and release-related infrastructure.

Requirements:

  • BS or MS degree in Computer Science, Computer Architecture, or related technical field or equivalent experience.
  • 5+ years of industry experience in infrastructure engineering, DevOps.
  • Strong system-level programming in languages like Python and shell scripting.
  • Strong understanding of build/release systems, CI/CD, and experience with solutions like Gitlab, Github, Jenkins, etc.
  • Experience with Linux system administration.
  • Proficient with containerization and cluster management technologies like Docker and Kubernetes.
  • Experience in build tools, including Make, Cmake.
  • Experience using or deploying source code management (SCM) solutions such as GitLab, GitHub, Perforce, etc.
  • Excellent problem-solving and debugging skills.
  • Great teammate who can collaborate and influence in a dynamic environment with excellent interpersonal and written communication skills.

Preferred Qualifications:

  • Previous experience with GPU-accelerated systems.
  • Hands-on experience with DL frameworks (PyTorch, JAX, Tensorflow).
  • Cluster/cloud technologies (SLURM, Lustre, k8s).
  • Experience with HPC hardware systems such as compute clusters and HPC software performance benchmarking on such systems.

Compensation:

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.

NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.



  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled AI Software Engineer to join our team at NVIDIA. As a key member of our NeMo framework development team, you will be responsible for crafting and implementing new model development features, optimizations, defining APIs, analyzing and tuning performance, expanding functionality coverage to build larger, coherent toolsets and...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled AI Software Engineer to join our team at NVIDIA. As a key member of our team, you will be responsible for crafting and implementing new model development features, optimizations, defining APIs, analyzing and tuning performance, expanding functionality coverage to build larger, coherent toolsets and libraries.Key...


  • Santa Clara, California, United States J&J Family of Companies Full time

    Job Title: Senior Principal MLOps EngineerWe are seeking a highly skilled Senior Principal MLOps Engineer to join our team at Johnson & Johnson. As a key member of our engineering team, you will be responsible for designing, developing, and deploying machine learning models and pipelines that drive business value.Key Responsibilities:Lead the development of...


  • Santa Clara, California, United States NVIDIA Full time

    As a Senior MLOps Engineer, you will play a key role in building software that will be used by the entire world. You will work with high-class software engineers to implement a large-scale toolset that tests deep learning models and frameworks on the most powerful computers.The ability to work in a multifaceted, fast-paced environment is required, as well as...


  • Santa Clara, California, United States NVIDIA Full time

    At NVIDIA, we're building software that will be used by the entire world. As a Senior MLOps Engineer, Deep Learning Algorithms, you'll work with high-class software engineers to implement a large-scale toolset that tests deep learning models and frameworks on the most powerful computers.The ability to work in a multifaceted, fast-paced environment is...


  • Santa Clara, California, United States NVIDIA Full time

    At NVIDIA, we're building software that will be used by the entire world. As a Senior MLOps Engineer, Deep Learning Algorithms, you'll work with high-class software engineers to implement a large-scale toolset that tests deep learning models and frameworks on the most powerful computers.The ability to work in a multifaceted, fast-paced environment is...


  • Santa Clara, California, United States Amazon Full time

    Job DescriptionWe are seeking a highly skilled GenAI Solutions Architect to join our team at Amazon. As a GenAI Solutions Architect, you will be responsible for designing and implementing scalable GenAI solutions for our customers. You will work closely with our engineering teams to develop and deploy GenAI workloads on AWS, and will facilitate the...


  • Santa Clara, California, United States Amazon Web Services, Inc. - A97 Full time

    About the RoleWe are seeking a highly skilled and experienced Business Development Specialist to join our Worldwide Specialist Organization (WWSO) Frameworks ML team. As a Business Development Specialist, you will be responsible for defining, building, and deploying targeted strategies to accelerate customer adoption of our GenAI services and solutions...


  • Santa Clara, California, United States Amazon Web Services, Inc. - A97 Full time

    About the RoleWe are seeking a highly skilled and experienced Business Development Manager to join our GenAI team at Amazon Web Services, Inc. - A97. As a Business Development Manager, you will be responsible for driving the growth of our GenAI business by identifying and pursuing new opportunities, building and maintaining relationships with key customers...


  • Santa Clara, California, United States J&J Family of Companies Full time

    Job Title: Sr Principal MLOPS EngineerJohnson & Johnson is seeking a Sr Principal MLOPS Engineer to join our team in the US. As a key member of our engineering team, you will be responsible for leading the development and deployment of machine learning systems and training infrastructure.Key Responsibilities:Lead strategic execution around ML systems and...


  • Santa Clara, California, United States Nvidia Full time

    Job Title: Senior Software Engineer - Cybersecurity AI FrameworkNVIDIA is a leader in the field of computer graphics, PC gaming, and accelerated computing. We are seeking a Senior Software Engineer to join our Morpheus team, which empowers cybersecurity workflows by developing the Morpheus SDK. This SDK enables customers to create high-throughput,...


  • Santa Clara, California, United States ServiceNow Full time

    About the RoleWe are seeking a highly skilled Senior Software Engineer to join our AI Engineering team at ServiceNow. As a key member of our team, you will be responsible for designing, implementing, and maintaining efficient, reusable, and reliable Python code for our AI-driven software solutions.As a Senior Software Engineer, you will have the opportunity...


  • Santa Clara, California, United States Amazon Full time

    About the RoleWe are seeking a highly skilled Business Development Specialist to join our Worldwide Specialist Organization (WWSO) Frameworks ML team. As a Business Development Specialist, you will be responsible for defining, building, and deploying targeted strategies to accelerate customer adoption of our GenAI services and solutions across industry...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Solutions Architect to join our Global Partner Team. As a key member of our team, you will be responsible for working with our Global Systems Integrator partners and AI consulting firms to develop and implement innovative solutions that leverage NVIDIA's cutting-edge technology.Key ResponsibilitiesBecoming an...


  • Santa Monica, California, United States Amazon Full time

    About the RoleWe are seeking a highly skilled GenAI Solutions Architect to join our team. As a key member of our organization, you will be responsible for designing and implementing GenAI solutions for our media and entertainment clients.Your primary focus will be on developing and deploying GenAI models and applications using AWS GenAI services such as...


  • Santa Clara, California, United States Amazon Web Services, Inc. Full time

    About the RoleWe are seeking a highly skilled GenAI Applied Science Manager to join our team at Amazon Web Services, Inc. This is a unique opportunity to lead the development of cutting-edge GenAI technologies that will revolutionize the way we provide support to our customers.Key ResponsibilitiesLead a team of scientists and engineers to design, develop,...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a skilled Product Manager to lead the development of products that enable developers to optimize their GenAI workloads on NVIDIA GPUs. As a Product Manager, you will be responsible for driving the creation of cloud services and APIs that allow developers to articulate their unique requirements for diverse applications and...


  • Santa Clara, California, United States Amazon Web Services, Inc. Full time

    About the RoleWe are seeking a highly skilled Senior Applied Science Manager to lead our GenAI team at AWS Kumo. As a key member of our organization, you will be responsible for developing and implementing machine learning models that drive business outcomes. Your expertise in natural language processing, deep learning, and generative AI will be instrumental...


  • Santa Clara, California, United States ServiceNow Full time

    Job DescriptionServiceNow is a global market leader in AI-enhanced technology, serving over 8,100 customers, including 85% of the Fortune 500. We're seeking a skilled Senior Software Engineer to join our AI Engineering team and drive the development of highly scalable backend services that impact AI-related products.About Digital TechnologyWe're redefining...

  • Principal Scientist

    4 days ago


    Santa Clara, California, United States Amazon Full time

    About the RoleWe are seeking a highly skilled Principal Scientist to join our team at Amazon. As a key member of our organization, you will be responsible for leading advanced research in Large Language Models (LLMs), Generative AI, and Deep Learning.Key ResponsibilitiesConduct research and develop novel algorithms, architectures, and methodologies for...