Lead Software Engineer, Cloud Infrastructure

2 weeks ago


Santa Clara, California, United States NVIDIA Full time

NVIDIA is seeking talented engineers to enhance its AI Infrastructure. We are looking for individuals with a robust programming foundation, profound knowledge of distributed systems, and a strong grasp of software testing and deployment methodologies. Excellent communication and organizational skills are essential. We value innovative thinkers who can contribute fresh ideas while demonstrating a strong execution focus. You will be continuously challenged, fostering personal and professional growth. Collaborating with fellow engineers, you will play a pivotal role in advancing NVIDIA's capabilities to develop and implement leading infrastructure solutions for a wide array of AI-driven applications that impact core data science.

Your Responsibilities:

  • Design and architect a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across various cloud providers.
  • Implement monitoring and health management features that ensure industry-leading reliability, availability, and scalability of GPU assets. By leveraging multiple data streams, including GPU hardware diagnostics and cluster telemetry, you will be able to predict system failures and optimize workload success rates.
  • Collaborate with engineering teams across NVIDIA to ensure seamless integration of your software from hardware to AI training applications.

What We Expect:

  • A highly motivated individual with strong communication skills, capable of working effectively with cross-functional teams, principles, and architects while coordinating across organizational boundaries.
  • A minimum of 5 years of software engineering experience with large-scale production systems.
  • A Bachelor’s degree in Computer Science, Engineering, Physics, Mathematics, or a related field, or equivalent experience.
  • Expert-level proficiency in a systems programming language (such as Go or Python) and a solid understanding of Data Structures and Algorithms.
  • Knowledge of performance, security, and reliability in complex distributed systems, including familiarity with system-level architecture, data synchronization, fault tolerance, and state management.

Ways to Distinguish Yourself:

  • Proficiency in architecting and managing large-scale distributed systems, regardless of cloud providers.
  • Advanced hands-on experience and in-depth understanding of cluster management systems (e.g., Kubernetes, Slurm, Bright Cluster Manager).
  • Proven track record of operational excellence in designing and maintaining AI infrastructure.

NVIDIA is recognized as one of the most desirable employers in the technology sector. We pride ourselves on having some of the most innovative and dedicated individuals in the industry. If you are creative and self-driven, we encourage you to explore this opportunity.

The compensation package includes a competitive salary, equity, and comprehensive benefits. NVIDIA is committed to fostering a diverse workplace and is proud to be an equal opportunity employer. We value diversity in our workforce and do not discriminate based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.



  • Santa Clara, California, United States NVIDIA Full time

    The NVIDIA GPU Cloud (NGC) team is seeking experienced software engineers to develop NVIDIA's advanced compute cloud solutions. These solutions encompass software for managing hardware and network provisioning to create a multi-tenant infrastructure. As a software engineer, you will collaborate with fellow engineers, product architects, and product managers...

  • Software Engineer

    1 week ago


    Santa Clara, California, United States Oracle Full time

    Software Engineer - Cloud Engineering Infrastructure DevelopmentOracle is seeking a skilled Software Engineer to design, develop, and troubleshoot software programs for various purposes, including file storage, databases, applications, and tools networks.Key Responsibilities:Collaborate with cross-functional teams to define and develop software for tasks...


  • Santa Clara, California, United States Oracle Full time

    Job DescriptionJob Summary: We are seeking a highly skilled and experienced Senior Principal Software Engineer to join our Cloud Engineering Infrastructure Development team at Oracle. As a key member of our team, you will be responsible for designing, developing, and performance tuning the networking stack required to run distributed AI/ML/HPC workloads...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs stands at the forefront of innovative connectivity solutions, enabling the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform seamlessly integrates PCIe, CXL, and Ethernet semiconductor-based solutions alongside the COSMOS software suite, delivering a software-defined architecture that is both scalable and...


  • Santa Clara, California, United States Amazon Full time

    About the RoleWe are seeking a Cloud Software Engineer to join our innovative team focused on enhancing the Developer Experience. Our mission is to leverage GenAI to empower developers in creating applications that are faster, more cost-effective, secure, and reliable.GenAI will enable a diverse range of builders to harness the capabilities of AWS,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewPalo Alto Networks is dedicated to safeguarding our digital existence. Our mission is to be the premier cybersecurity partner, ensuring a secure and safe environment for everyone.VisionWe envision a future where each day is more secure than the last. Our foundation is built on innovation and a commitment to redefining the...


  • Santa Clara, California, United States NVIDIA Full time

    We are looking for a Lead Cloud Software Engineer to become a vital member of the DRIVE Sim Cloud team at NVIDIA. In this position, you will play a key role in shaping the future of autonomous vehicle technology. You will thrive in a fast-paced environment where creativity and challenging conventional methods are encouraged. Your proficiency in backend...


  • Santa Clara, California, United States Oracle Full time

    Job OverviewJoin our dynamic team at Oracle Cloud Infrastructure (OCI) Platform Integration (PINT) within the Enterprise Engineering Services Organization. We specialize in the development of tools and the management of OCI cloud lab environments.Our labs serve as crucial pre-production settings, enabling teams to seamlessly integrate and validate their...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewYour Career JourneyUtilize your expertise in backend Java cloud engineering to contribute to cutting-edge cloud software and web applications. Join us in deploying and scaling the next generation of cloud security, leveraging big data and analytics.We are seeking a Principal Engineer to be part of the team dedicated to developing our latest cloud...


  • Santa Clara, California, United States Geospatial And Cloud Analytics Inc Full time

    About the RoleWe are seeking a highly skilled Senior Cloud Reliability Engineer to join our team at Geospatial And Cloud Analytics Inc. As a key member of our engineering team, you will be responsible for designing, implementing, and supporting operational and reliability aspects of large-scale cloud infrastructure.Key ResponsibilitiesDesign and implement...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Principal Cloud Reliability Engineer to join our team. As a key member of our cloud infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of our cloud infrastructure team by...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewOur PurposeAt Palo Alto Networks, our journey begins and ends with our core purpose:To be the trusted cybersecurity ally, safeguarding our digital existence.We envision a future where each day is more secure than the last. Our foundation is built on challenging the status quo and we seek innovators dedicated to shaping the future of...


  • Santa Clara, California, United States eTeam Full time

    Job DescriptionJob Title: Cloud Infrastructure ArchitectLocation: Remote (with occasional travel)Job Type: Full-timeAbout eTeam: eTeam is a leading provider of cloud-based solutions, dedicated to delivering innovative and secure infrastructure to our clients.Job Summary: We are seeking an experienced Cloud Infrastructure Architect to join our team. The ideal...


  • Santa Clara, California, United States NVIDIA Full time

    The NVIDIA GPU Cloud (NGC) team is in search of dedicated software engineers who are eager to collaborate closely with our internal stakeholders and facilitate their integration into our platform. This collaboration necessitates a thorough understanding of customer requirements, the functioning of their applications, and assisting them in establishing best...


  • Santa Clara, California, United States Cryptoware Technologies Inc Full time

    Job DescriptionJob SummaryCryptoware Technologies Inc is seeking a highly skilled Global Infrastructure Expansion Lead to join our team. As a key member of our engineering team, you will be responsible for leading the effort of global expansion of our globe-spanning infrastructure.Key ResponsibilitiesLead the effort of global expansion of our globe-spanning...


  • Santa Clara, California, United States Amazon Full time

    Join Our Team as a Lead Software EngineerAre you ready to influence the evolution of computing within the Amazon Web Services cloud? The EC2 Enterprise Workloads division is dedicated to solving complex challenges faced by enterprise clients through innovative cloud solutions. Our team leverages state-of-the-art technologies to create extensive platforms...


  • Santa Clara, California, United States TechStar Group Full time

    Job Title: Cloud Infrastructure Architect**Job Summary:**We are seeking a highly skilled Cloud Infrastructure Architect to join our team at TechStar Group. As a key member of our infrastructure team, you will be responsible for designing, implementing, and managing our cloud infrastructure to ensure high levels of performance, availability, and security.Key...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewAt Palo Alto Networks, our mission is clear:To be the trusted cybersecurity partner, safeguarding our digital lives.We envision a future where each day is more secure than the last. Our foundation is built on challenging the status quo and we seek innovators dedicated to shaping the future of cybersecurity.Work PhilosophyWe...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company Overview Our Purpose At Palo Alto Networks, our mission is at the heart of everything we do: To be the trusted cybersecurity partner, safeguarding our digital existence. We envision a future where each day is more secure than the last. Our foundation is built on challenging the status quo and redefining how cybersecurity is approached. We...


  • Santa Clara, California, United States NVIDIA Full time

    As a Lead Software Quality Assurance Engineer at NVIDIA, you will collaborate with a team of passionate professionals committed to pushing the boundaries of technology. Your role will be pivotal in assessing, developing test content, and validating our software releases, ensuring that our products uphold outstanding quality benchmarks. Utilizing our...