Senior Site Reliability Engineer

3 days ago


Santa Clara, California, United States Nvidia Full time

Senior Site Reliability Engineer - Storage

locations
US, CA, Santa Clara
time type
Full time

job requisition id
JR1979072
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence.

Join our team at NVIDIA as a Senior Site reliability engineer focused on HPC storage and play a crucial role in designing, implementing, and optimizing on-prem High-Performance Computing (HPC) storage solutions while harnessing the power of cloud computing. You will be responsible for crafting and deploying distributed storage solutions, build automation tools, and ensuring the efficient operations of our growing IT ecosystem. You will collaborate closely with engineering teams to align infrastructure with their evolving needs, document best practices, and contribute to the success of ground breaking projects.

What You'll Be Doing

Design, implement an on-prem HPC infrastructure supplemented with cloud computing to support the growing IT needs of Nvidia.

Design and implement scalable and efficient Storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.

Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.

Document the general procedures and practices, perform technology evaluations, related to distributed file systems.

Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.

Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.

What we need see:

BS in Computer Science (or equivalent experience) with 8+ years of relevant experience, MS with 5+ years of experience or Ph.D. with 3 years of experience.

8+ years of experience crafting technology solutions and resolving performance bottlenecks for HPC applications.

Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.

Design, deployment and management of Enterprise NAS solutions like NetApp, Pure Storage and S3 storage.

Python/Bash/Golang programming/scripting experience.

Strong Experience operating services in any of the leading Cloud environment [ AWS, Azure or GCP].

Excellent communication and collaboration skills.

Ways To Stand Out Of The Crowd:

Background with RDMA (InfiniBand or RoCE) fabrics.

Experience with multiple monitoring stacks such as Prometheus+Grafana, Elasticsearch+Kibana, Splunk, Zabbix, etc. Familiarity with newer and emerging monitoring products.

Prior Experience with HPC cluster management tools such as Slurm, PBS, LSF, etc.

Experience with containerization technologies, such as Docker, Mesosphere DCOS, Kubernetes (k8s).

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you

The base salary range is 164,000 USD - 310,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.



  • Santa Clara, California, United States TEKsystems Full time

    :As a Senior Site Reliability Engineer, you will have the responsibility for provisioning and operating our high-availability systems that provide automated control, monitoring, and alerting at our production data centers Your duties will include: Ensuring high levels of systems reliability and availability in a global enterprise data center setting....


  • Santa Clara, California, United States Johnson & Johnson Full time

    Job Description Johnson & Johnson's Robotic and Digital Solutions (RAD) group is recruiting for a Senior Reliability Test Engineer , located in Santa Clara, CA . Robotics & Digital Solutions is part of Ethicon, Inc., a global leader in surgery with products and solutions found in almost every operating room around the world. Ethicon has made significant...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Reliability Engineer at Palo Alto NetworksCompany DescriptionAbout UsAt Palo Alto Networks, our ultimate goal is to be the go-to partner in cybersecurity, safeguarding our digital lifestyle. Imagine a world where each day is safer and more secure than the last. We thrive on challenging conventions and are on the lookout for visionaries ready to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Our MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we're looking for...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA has continuously reinvented itself over three decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI - the next era of computing. NVIDIA is a "learning machine" that constantly evolves by adapting...


  • Santa Clara, California, United States Johnson & Johnson Full time

    Job Description Johnson & Johnson, Robotics and Digital Solutions (RAD) group is recruiting for a Senior Component Engineer locatedin Santa Clara, CA. The Hardware Team in the RAD group is a diverse group of highly motivated, world-class engineers developing next-generation, groundbreaking robotic platforms. Please join our team and contribute to the...


  • Santa Clara, California, United States NVIDIA Full time

    Nvidia Senior Supplier Quality Engineer Santa Clara , California Apply Now NVIDIA is seeking a Senior Supplier Quality Engineer responsible for working with NVIDIA's component suppliers to develop and implement world class supplier quality programs for our AI centric datacenter, automotive and consumer products. This is a meaningful role that will closely...

  • Senior Site Director

    1 month ago


    Santa Clara, California, United States Amentum Full time

    Amentum is seeking a Senior Site DirectorThis position is responsible for the overall operation of the project/contract including ensuring the safety and well being of its employees, safeguarding company funds and property, and representing the company with respect to the client.The Senior Site Director oversees a workforce of Operations and project managers...

  • Senior Site Director

    4 weeks ago


    Santa Clara, California, United States Amentum Full time

    Amentum is seeking a Senior Site DirectorThis position is responsible for the overall operation of the project/contract including ensuring the safety and well being of its employees, safeguarding company funds and property, and representing the company with respect to the client.The Senior Site Director oversees a workforce of Operations and project managers...


  • Santa Clara, California, United States Halo Industries Full time

    As a Senior Systems Engineer at Halo Industries, you will play a crucial role in the development and integration of our groundbreaking semiconductor manufacturing technology. Leveraging your expertise in system design, integration, and automation, particularly within the semiconductor industry or related fields, you will contribute to the evolution of our...


  • Santa Clara, California, United States proteanTecs Full time

    proteanTecs is a dynamic fast-paced start-up company, transforming the way reliability of electronics is achieved. In a world where machines are gaining immense responsibility over our lives, sudden failure is not an option.We have developed a cloud-based platform, which combines data created in chip-embedded Agents (IPs), with machine learning, to predict...


  • Santa Clara, California, United States Cirtec Medical Full time

    This position is located on-site in Santa Clara, California ABOUT THE COMPANYMore than just another design shop or contract manufacturer, Cirtec offers a uniquely comprehensive range of vertically integrated capabilities. Cirtec specializes in complex, difficult to produce components and devices in today's most advanced product technologies including,...


  • Santa Clara, California, United States Q-Cells Full time

    Hanwha Q CELLS Co., Ltd., is one of the world ́s largest and most recognized photovoltaicmanufacturers for its high-performance, high-qualitysolarcells and modules. It is headquartered in Seoul, South Korea (Global Executive HQ) and Talheim, Germany (Technology & Innovation HQ). Through its growing global business network spanning Europe, North America,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company DescriptionOur MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and...


  • Santa Clara, California, United States Omega Solutions Full time

    Reference Number: eRFO-ISD-FY24-0057Qty of Staff Needed: 1Position Title: Senior Systems Administrator/ Systems EngineerSan Jose, CAContract Term: 6 monthsSlots: 2Deadline: 8/9/2023Rate: $30/hr on w2 or $40/hr on c2cRequesting Department: Technology Services and SolutionsOn-Site Requirements: None - 100% Remote is allowed. DeliverablesSupplement Sr...


  • Santa Clara, California, United States CBRE Full time

    Building Engineer, Senior Job ID 161725 Posted 04-Apr-2024 Service line GWS Segment Role type Full-time Areas of Interest Building Management, Engineering/Maintenance, Facilities Management Location(s) Santa Clara - California - United States of America ABOUT THE ROLE Performs sophisticated preventive and corrective maintenance, repairs and installations of...


  • Santa Clara, California, United States proteanTecs Full time

    proteanTecs is a dynamic fast-paced start-up company, transforming the way the reliability of electronics is achieved. In a world where machines are gaining immense responsibility over our lives, sudden failure is not an option.We have developed a cloud-based platform, which combines data created in chip-embedded Agents (IPs), with machine learning, to...


  • Santa Clara, California, United States Natron Energy Full time

    Natron Energy is seeking an experienced Product Design (PD) Engineer to take on the development of Natron's proprietary Sodium Ion Battery Packs and Power Electronics. As a PD Engineer, you will be responsible for the mechanical and electromechanical design of Natron's battery packs and accompanying power electronics. Your team will be responsible for the...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is seeking an outstanding Senior ASIC Verification Engineer to verify the design and implementation of the world's leading SoC's and GPU's. This position offers the opportunity to have a real impact in a dynamic, technology-focused company impacting product lines ranging from consumer graphics to self-driving cars and the growing field of artificial...


  • Santa Clara, California, United States Amazon Full time

    Senior Software Engineer, Redshift Data Management Job ID: | Amazon Development Center U.S., Inc.Amazon Redshift is looking for talented individuals with expertise and passion for building DBMS internals. We are interested in people who can conduct independent research and have a passion for software and system building, that is, smart people who get stuff...