Senior Site Reliability Engineer

1 month ago


Santa Clara, California, United States Nvidia Full time
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization.

The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.

Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars.

NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.


Join our team at NVIDIA as a Senior Site reliability engineer focused on HPC storage and play a crucial role in designing, implementing, and optimizing on-prem High-Performance Computing (HPC) storage solutions while harnessing the power of cloud computing.

You will be responsible for crafting and deploying distributed storage solutions, build automation tools, and ensuring the efficient operations of our growing IT ecosystem.

You will collaborate closely with engineering teams to align infrastructure with their evolving needs, document best practices, and contribute to the success of ground breaking projects.

What You'll Be Doing

  • Design, implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of Nvidia.
  • Design and implement scalable and efficient Storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.
  • Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.
  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.
What We Need See

  • BS in Computer Science (or equivalent experience) with 8+ years of relevant experience, MS with 5+ years of experience or Ph.
D. with 3 years of experience.

  • 8+ years of experience crafting technology solutions and resolving performance bottlenecks for HPC applications.
  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.
  • Design, deployment and management of Enterprise NAS solutions like NetApp, Pure Storage and S3 storage.
  • Python/Bash/Golang programming/scripting experience.
  • Strong Experience operating services in any of the leading Cloud environment [ AWS, Azure or GCP].
  • Excellent communication and collaboration skills.
Ways To Stand Out Of The Crowd

  • Background with RDMA (InfiniBand or RoCE) fabrics.
  • Experience with multiple monitoring stacks such as Prometheus+Grafana, Elasticsearch+Kibana, Splunk, Zabbix, etc. Familiarity with newer and emerging monitoring products.
  • Prior Experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.
  • Experience with containerization technologies, such as Docker, Mesosphere DCOS, Kubernetes (k8s).
NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you

The base salary range is 164,000 USD - 310,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Promote Project Full time

    About Promote Project: Promote Project is a leader in innovative technology solutions, dedicated to pushing the boundaries of what is possible in the realm of artificial intelligence and cloud computing. Our commitment to excellence is reflected in our talented workforce and our pursuit of groundbreaking advancements.Position Overview: We are seeking a...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.Position Overview: We are seeking a Manager of Site Reliability Engineering to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...


  • Santa Clara, California, United States Anello Full time

    About Anello Photonics:ANELLO Photonics is a leading-edge technology company based in Santa Clara, CA. The company has developed integrated photonic system-on-chip technology for next generation navigation. ANELLO's SIPHOGTM gyroscope is based on its patented photonic integrated circuit technology. The result is a product that is higher performance, much...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:A Bachelor’s degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics is required. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:A Bachelor’s degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework in semiconductor physics and electronics is required. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronic systems. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC/AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...


  • Santa Clara, California, United States OMNIVISION Full time

    Job Overview Experience: A Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field is required, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability is...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, focusing on redefining technology and enhancing the capabilities of AI. We are dedicated to creating groundbreaking solutions that push the boundaries of what is possible in computing.Position Overview: We are seeking a Manager for Site Reliability Engineering to spearhead our cloud...


  • Santa Clara, California, United States Centrify Corporation Full time

    **About Centrify Corporation**Centrify Corporation is a leading provider of cloud-based identity and access management solutions. Our software runs on public clouds with 99.9% or better uptime and is mission critical for our customers.**Job Summary**We are seeking a highly skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a Cloud Site...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is dedicated to its mission of being the cybersecurity partner of choice, safeguarding our digital existence. Our vision is to create a world that is increasingly secure and safe.We are a company that thrives on innovation and challenges the conventional ways of operating. We seek forward-thinking individuals who are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is driven by a mission to be the cybersecurity partner of choice, safeguarding our digital lifestyle. Our vision is to create a world that is increasingly secure and safe.We are built on the principles of innovation and disruption, seeking individuals who are passionate about shaping the future of cybersecurity.Work...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is actively seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in Board Level Reliability laboratory activities, establishing functional test hardware and software for various NV products, including large server...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is dedicated to our mission of being the cybersecurity partner of choice, ensuring the safety of our digital lives. Our vision is to create a world that is increasingly secure and resilient.We pride ourselves on challenging the conventional approaches to cybersecurity and are in search of innovative thinkers who are eager...


  • Santa Clara, California, United States Innova Solutions Full time

    Innova Solutions is actively seeking a Reliability Engineer. Position Type: Full Time Location: Santa Clara, CA As a Reliability Engineer, your responsibilities will include: Key Responsibilities:Engaging in Board Level Reliability laboratory operations, establishing functional testing hardware and software for various NV products, including extensive server...