Platform and Infrastructure Engineer
3 days ago
NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization.
The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.
Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.
We're looking for highly motivated engineers to help us accelerate the next wave of artificial intelligence.
As a key member of our team, you will develop and maintain software facilitating GPU communication, driving groundbreaking solutions in High Performance Computing and Deep Learning.
You will implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability, ensuring seamless operations.
Key responsibilities include:
- Develop automated tools to efficiently deploy, provision, and maintain extensive GPU clusters interconnected via NVLink and InfiniBand
- Implement modern DevOps tools to automate software updates, perform maintenance tasks, and monitor cluster availability
- Take ownership of daily cluster failures and issues, troubleshooting them promptly to maintain optimal cluster availability and performance
- Manage the rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruptions
Requirements include:
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience
- 5+ years of hands-on experience in deploying and administrating clusters, servers, switches, and related infrastructure
- Automation expert with hands-on skills in Ansible, Python, and Shell Scripting
- Deep understanding of operating systems, computer networks, and high-performance applications
- Proven ability to work effectively with developers and test engineers across different teams and time zones
- Proficient with Linux fundamentals
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.
We highly value diversity in our current and future employees and do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
The base salary range is 148,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and benefits.
-
Cloud Platform Engineer
2 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a skilled Cloud Platform Engineer to join our team. As a key member of our Infrastructure team, you will be responsible for designing, building, and maintaining mission-critical infrastructure and tools as a platform.You will work closely with other engineering teams to provide technical vision and ensure that our...
-
Cloud Infrastructure Engineer
1 week ago
Santa Clara, California, United States Astera Labs Full timeAstera Labs: Transforming Data-Driven ApplicationsAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure.Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools...
-
Cloud Platform Engineer
5 days ago
Santa Clara, California, United States Palo Alto Networks Full timePalo Alto Networks is seeking a skilled Cloud Platform Engineer to join our team. As a Cloud Platform Engineer, you will be responsible for designing, building, and maintaining mission-critical infrastructure and tools as a platform. You will work closely with other engineering teams to ensure microservices are designed with scale, operability, and...
-
Senior Infrastructure Performance Engineer
3 weeks ago
Santa Clara, California, United States NVIDIA Full timeTransform IT Compute Platform ArchitectureNVIDIA is at the forefront of technological innovation, driving efficiency and optimizing the performance of our infrastructure both on-prem and cloud. We are seeking a highly skilled Senior Staff Infrastructure Performance Engineer to join our dynamic team.Key Responsibilities:Lead initiatives to transform IT...
-
Senior Staff Infrastructure Performance Engineer
1 month ago
Santa Clara, California, United States NVIDIA Full timeTransform IT Compute Platform ArchitectureNVIDIA is seeking a highly skilled Senior Staff Infrastructure Performance Engineer to join our dynamic team. As a key member of our IT organization, you will be responsible for leading initiatives to transform our IT Compute platform architecture to build new service offerings across On-Prem & Cloud.Key...
-
Senior Infrastructure Engineer
4 weeks ago
Santa Clara, California, United States Sustainable Talent Full timeJob OverviewSustainable Talent is seeking a highly skilled Senior Infrastructure Engineer to support the NVIDIA Cloud Infrastructure Team. As a key member of our team, you will be responsible for supporting infrastructure team operations, cloud infrastructure system enrollments, deployments, and troubleshooting.Key Responsibilities:Support Infrastructure...
-
Cloud Infrastructure Engineer
2 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure platform team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our infrastructure platform stack includes Terraform, Kubernetes, GitLab...
-
Senior Cloud Infrastructure Engineer
1 week ago
Santa Clara, California, United States NVIDIA Full timeJob DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...
-
Senior DevOps Engineer- Platform
3 weeks ago
Santa Clara, California, United States NVIDIA Full timeAbout the RoleNVIDIA is seeking a highly skilled and motivated Kubernetes Architect/Engineer to join its fast-paced Infrastructure, Planning and Processes organization. As a Principal DevOps & SRE Engineer, you will play a critical role in designing and implementing Kubernetes solutions for the company's Cloud Platform.Key ResponsibilitiesArchitect, design,...
-
Senior Cloud Infrastructure Engineer
4 weeks ago
Santa Clara, California, United States NVIDIA Full timeJoin NVIDIA's AI Efficiency TeamWe are seeking a Senior Site Reliability Engineer to contribute to the infrastructure that powers our innovative AI research.About the RoleThis team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services.Our objective is to deliver a stable,...
-
Senior Data Platform Engineer
4 days ago
Santa Clara, California, United States NVIDIA Full timeThe NVIDIA Operations organization is seeking an experienced software engineering professional for the position of System Data, Software Engineer.As a member of our team, you will be an integral part of building cloud-based data platforms. You will support initiatives for the Data Platform, Reporting, and Analytics. Your work will turn data into information...
-
Staff Data Platform Engineer
1 week ago
Santa Clara, California, United States XPENG Motors Full timeJob Title: Staff Data Platform EngineerJob Summary:We are seeking a highly skilled Staff Data Platform Engineer to join our team at XPeng Motors. As a key member of our data platform development team, you will be responsible for designing and implementing a cutting-edge real-time data management platform for autonomous driving.Responsibilities:* Design and...
-
Santa Clara, California, United States Apple Full timeAbout the RoleWe are seeking a highly skilled Staff Machine Learning Infrastructure Engineer to join our ML Compute Team at Apple. As a key member of our team, you will be responsible for designing and delivering critical features to facilitate ML compute workloads.Your Key ResponsibilitiesCollaborate with teams across Apple on ML workloads such as training,...
-
Cloud Infrastructure Architect
2 weeks ago
Santa Clara, California, United States Astera Labs Full timeAstera Labs Job DescriptionAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools to deliver a...
-
Platform and EngOps Engineer
4 days ago
Santa Clara, California, United States NVIDIA Full timeNVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization.The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once...
-
Cloud Infrastructure Engineer
2 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RolePalo Alto Networks is seeking a highly skilled Senior Staff DevOps Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our infrastructure platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps, Prometheus, Grafana,...
-
Software Engineer
1 week ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionAt Palo Alto Networks, we're seeking a talented Software Engineer to join our Cloud Management Platform team. As a key member of our engineering team, you'll be responsible for designing and developing scalable microservices that enable our cloud products.Our ideal candidate is a passionate engineer with a strong background in cloud platforms,...
-
Software Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeJob DescriptionPalo Alto Networks is seeking a highly skilled Software Engineer to join our Cloud Management Platform team. As a key member of our team, you will be responsible for designing, developing, and deploying scalable microservices used to activate all Palo Alto Networks cloud products.Key ResponsibilitiesDesign and implement complex software...
-
Software Engineer
4 weeks ago
Santa Clara, California, United States Palo Alto Networks Full timeAbout the RoleWe're seeking a talented Software Engineer to join our Internet Security Infrastructure Team at Palo Alto Networks. As a key member of our team, you will be responsible for designing and developing large-scale backend systems that drive our cybersecurity solutions.Key ResponsibilitiesDesign and develop large-scale backend systems that meet the...
-
AI Infrastructure Engineer
4 days ago
Santa Clara, California, United States XPENG Motors Full timeJob Title: AI Infrastructure Engineer - Scalable SolutionsXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures smart EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.We are...