Platform and Infrastructure Engineer

3 weeks ago


Santa Clara, California, United States NVIDIA Full time

NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization.

The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.

Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We're looking for highly motivated engineers to help us accelerate the next wave of artificial intelligence.

As a key member of our team, you will develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning.

You will implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.

Key responsibilities include:

  • Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand
  • Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability
  • Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance
  • Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions

Requirements include:

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience
  • 5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
  • Automation expert with hands-on skills in Ansible, Python, and Shell Scripting
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • Proven ability to work effectively with developers and test engineers across different teams and time zones
  • Proficient with Linux fundamentals

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

We highly value diversity in our current and future employees and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

The base salary range is 148,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.



  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a skilled Cloud Platform Engineer to join our team. As a key member of our Infrastructure team, you will be responsible for designing, building, and maintaining mission-critical infrastructure and tools as a platform.You will work closely with other engineering teams to provide technical vision and ensure that our...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs: Transforming Data-Driven ApplicationsAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure.Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Palo Alto Networks is seeking a skilled Cloud Platform Engineer to join our team. As a Cloud Platform Engineer, you will be responsible for designing, building, and maintaining mission-critical infrastructure and tools as a platform. You will work closely with other engineering teams to ensure microservices are designed with scale, operability, and...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...


  • Santa Clara, California, United States NVIDIA Full time

    The NVIDIA Operations organization is seeking an experienced software engineering professional for the position of System Data, Software Engineer.As a member of our team, you will be an integral part of building cloud-based data platforms. You will support initiatives for the Data Platform, Reporting, and Analytics. Your work will turn data into information...


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: Staff Data Platform EngineerJob Summary:We are seeking a highly skilled Staff Data Platform Engineer to join our team at XPeng Motors. As a key member of our data platform development team, you will be responsible for designing and implementing a cutting-edge real-time data management platform for autonomous driving.Responsibilities:* Design and...


  • Santa Clara, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Staff Machine Learning Infrastructure Engineer to join our ML Compute Team at Apple. As a key member of our team, you will be responsible for designing and delivering critical features to facilitate ML compute workloads.Your Key ResponsibilitiesCollaborate with teams across Apple on ML workloads such as training,...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization.The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once...

  • Software Engineer

    4 weeks ago


    Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionAt Palo Alto Networks, we're seeking a talented Software Engineer to join our Cloud Management Platform team. As a key member of our engineering team, you'll be responsible for designing and developing scalable microservices that enable our cloud products.Our ideal candidate is a passionate engineer with a strong background in cloud platforms,...


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: AI Infrastructure Engineer - Scalable SolutionsXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures smart EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.We are...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly skilled Senior Systems Engineer to work on scaling our cloud compute platform for Autonomous Vehicles (AV). Our platform provides access to 100s of PBs of data and exa-scale GPU+CPU compute for various AV workloads including data ingestion, processing and model training.We are embarking on building the next generation of the platform...


  • Santa Clara, California, United States Telenav Full time

    We are seeking a highly motivated Senior Data Platform Engineer to join our growing Auto team at Telenav. Our team is responsible for building and maintaining the data infrastructure that powers our connected car and location-based platform services.The ideal candidate will have experience in Java development, Hadoop, Hive, Spark, and other big data...


  • Santa Clara, California, United States ServiceNow Full time

    Transforming How We WorkAt ServiceNow, we're revolutionizing the way organizations work by harnessing the power of intelligent cloud-based technology. Our platform seamlessly connects people, systems, and processes to empower businesses to find smarter, faster, and better ways to work.Join Our MissionWe're seeking an experienced database architect with a...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Description:At Diverse Lynx LLC, we are seeking a skilled Cloud Engineer to join our team. As a key member of our infrastructure team, you will be responsible for designing, implementing, and maintaining our cloud infrastructure. Key Responsibilities:Design and implement cloud infrastructure solutions using AWS, Azure, or Google Cloud...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is seeking a highly skilled and experienced engineer to join our growing team. The successful candidate will work at the intersection of GPU chip design and AI, responsible for the design, development, and maintenance of the infrastructure around Nvidia's internal large language model aimed at facilitating chip design.Key Responsibilities:Develop and...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in the field of high-performance computing, and we are seeking a skilled Senior Software Engineer to join our team.The ideal candidate will have a strong background in software development, with experience in designing and creating reliable distributed systems. They will also have the ability to implement well-thought-out long-term...


  • Santa Clara, California, United States XPENG Full time

    Job Title: Staff Data Platform EngineerXpeng Motors is a leading smart electric vehicle company that designs, develops, manufactures, and markets smart EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.Job...


  • Santa Clara, California, United States Apple Full time

    Job Title: Senior Device Network Architect, Machine Learning Platform and InfrastructureAbout the Role:We are seeking a highly skilled Senior Device Network Architect to join our team at Apple. As a key member of our Machine Learning Platform and Infrastructure team, you will be responsible for designing and implementing large-scale automation and monitoring...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Senior IT Infrastructure EngineerJob Summary: We are seeking a highly skilled Senior IT Infrastructure Engineer to join our team at Diverse Lynx LLC. The ideal candidate will have expertise in VMware and OpenShift, with a strong focus on capacity planning, migrations, architectural planning, and operational issue resolution.Key...


  • Santa Clara, California, United States IT Management Corp. dba 101 VOICE Full time

    Job Title: IT Infrastructure SpecialistIT Management Corp. dba 101 VOICE is seeking an experienced IT Infrastructure Specialist to join our team. As an IT Infrastructure Specialist, you will play a pivotal role in designing, implementing, and managing our network infrastructure and systems, with a strong focus on cloud platforms and virtualization...