Current jobs related to Senior Site Reliability Engineer - Santa Clara - Nvidia


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, United States Nvidia Full time

    Senior Site Reliability Engineer - StoragelocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1979072NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and...


  • Santa Clara, United States Veear Full time

    Position: Site Reliability Engineer Location: Remote role Duration: 12+ Months Contract with possible extension Job Description: We seek development-heavy Site Reliability Engineers to design, build, maintain, and scale production services and server farms within our FedRAMP SASE product portfolio. We want passionate engineers who bring new ideas to all...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, United States VeeAR Projects Inc. Full time

    Position: Site Reliability EngineerLocation: Remote roleDuration: 12+ Months Contract with possible extensionJob Description: We seek development-heavy Site Reliability Engineers to design, build, maintain, and scale production services and server farms within our FedRAMP SASE product portfolio. We want passionate engineers who bring new ideas to all facets...


  • Santa Clara, United States VeeAR Projects Inc. Full time

    Position: Site Reliability EngineerLocation: Remote roleDuration: 12+ Months Contract with possible extensionJob Description: We seek development-heavy Site Reliability Engineers to design, build, maintain, and scale production services and server farms within our FedRAMP SASE product portfolio. We want passionate engineers who bring new ideas to all facets...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and outstanding people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Santa Clara, United States NVIDIA Full time

    Senior Site Reliability Engineer, Data Science and ML Platforms Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIA's data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is...


  • Santa Clara, United States Centrify Corporation Full time

    Our software runs on public clouds with 99.9% or better uptime and is mission critical for our customers. Our cloud operations team is where the rubber meets the road and needs innovative Site Reliability Engineers. Join a professional team of smart and hard-working professionals building enterprise-class cloud-based services in the rapidly growing market of...


  • Santa Clara, United States Geospatial And Cloud Analytics Inc Full time

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database,...


  • Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionJoin the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education,...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, United States DG Heating and Air Conditioning Inc Full time

    Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIAs data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data...


  • Santa Clara, California, United States Promote Project Full time

    About Promote Project: Promote Project is a leader in innovative technology solutions, dedicated to pushing the boundaries of what is possible in the realm of artificial intelligence and cloud computing. Our commitment to excellence is reflected in our talented workforce and our pursuit of groundbreaking advancements.Position Overview: We are seeking a...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.Position Overview: We are seeking a Manager of Site Reliability Engineering to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...


  • Santa Clara, United States Diverse Lynx Full time

    Skills: Site Reliability Engineering (SRE), GIT(Bitbucket), Jenkins, AWS CodeBuild, AWS CodeDeploy Job Description: AWS application and CI/CD pipelines, Microsoft Server admin and workload support (Data center and AWS) •Initial responsibility is application platform promotion to controlled environments for test, staging, and production AWS accounts. o...


  • Santa Clara, United States NVIDIA Full time

    Senior System Reliability Engineer Locations: US, CA, Santa Clara Time Type: Full time Posted on: Posted 2 Days Ago Job Requisition ID: JR1980220 NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing —...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...

Senior Site Reliability Engineer

2 months ago


Santa Clara, United States Nvidia Full time

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.

Join our team at NVIDIA as a Senior Site reliability engineer focused on HPC storage and play a crucial role in designing, implementing, and optimizing on-prem High-Performance Computing (HPC) storage solutions while harnessing the power of cloud computing. You will be responsible for crafting and deploying distributed storage solutions, build automation tools, and ensuring the efficient operations of our growing IT ecosystem. You will collaborate closely with engineering teams to align infrastructure with their evolving needs, document best practices, and contribute to the success of ground breaking projects.

What You'll Be Doing

• Design, implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of Nvidia.
• Design and implement scalable and efficient Storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.
• Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
• Document the general procedures and practices, perform technology evaluations, related to distributed file systems.
• Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.
• Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.


What We Need See

• BS in Computer Science (or equivalent experience) with 8+ years of relevant experience, MS with 5+ years of experience or Ph.D. with 3 years of experience.
• 8+ years of experience crafting technology solutions and resolving performance bottlenecks for HPC applications.
• Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.
• Design, deployment and management of Enterprise NAS solutions like NetApp, Pure Storage and S3 storage.
• Python/Bash/Golang programming/scripting experience.
• Strong Experience operating services in any of the leading Cloud environment [ AWS, Azure or GCP].
• Excellent communication and collaboration skills.


Ways To Stand Out Of The Crowd

• Background with RDMA (InfiniBand or RoCE) fabrics.
• Experience with multiple monitoring stacks such as Prometheus+Grafana, Elasticsearch+Kibana, Splunk, Zabbix, etc. Familiarity with newer and emerging monitoring products.
• Prior Experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.
• Experience with containerization technologies, such as Docker, Mesosphere DCOS, Kubernetes (k8s).


NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you

The base salary range is 164,000 USD - 310,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.