Senior Site Reliability Engineer, Data Science and ML Platforms

2 months ago


Santa Clara, United States NVIDIA Full time

Senior Site Reliability Engineer, Data Science and ML Platforms

Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIA's data-driven decision-making culture? If so, we have a great opportunity for you NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team. The role involves designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing. The responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of the platform, as well as applying SRE principles to improve production systems and optimize service SLOs. Additionally, collaboration with our customers to plan implement changes to the existing system, while monitoring capacity, latency, and performance is part of the role. What you’ll be doing: Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. Gain a deep understanding of our system operations, scalability, interactions, and failures to identify improvement opportunities and risks. Create tools and automation to reduce operational overhead and eliminate manual tasks. Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation. Define meaningful and actionable reliability metrics to track and improve system and service reliability. Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally. Build tools to improve our service observability for faster issue resolution. Practice sustainable incident response and blameless postmortems. What we need to see: Minimum of 5-8 years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments. Master's or Bachelor's degree in Computer Science or Electrical Engineering or CE or equivalent experience. Strong understanding of SRE principles, including error budgets, SLOs, and SLAs. Proficiency in incident, change, and problem management processes. Skilled in problem-solving, root cause analysis, and optimization. Experience with streaming data infrastructure services, such as Kafka and Spark. Expertise in building and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus). Proficiency in programming languages such as Python, Go, Perl, or Ruby. Hands-on experience with scaling distributed systems in public, private, or hybrid cloud environments. Experience in deploying, supporting, and supervising services, platforms, and application stacks. Ways to stand out from the crowd: Experience operating large-scale distributed systems with strong SLAs. Excellent coding skills in Python and Go and extensive experience in operating data platforms. Knowledge of CI/CD systems, such as Jenkins and GitHub Actions. Familiarity with Infrastructure as Code (IaC) methodologies and tools. Excellent interpersonal skills for identifying and communicating data-driven insights. The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr



  • Santa Clara, United States DG Heating and Air Conditioning Inc Full time

    Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIAs data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data...


  • Santa Clara, California, United States Tenstorrent Inc. Full time

    Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high...


  • Santa Clara, United States Intel Full time

    Job Description We are actively seeking a Senior Engineer specialized in ML Ops and DevSecOps! This role requires a comprehensive understanding of machine learning operations, DevSecOps practices, and the integration of security within the CI/CD pipeline. Ideal candidates will have a strong software engineering background, exceptional automation...


  • Santa Clara, California, United States Apple, Inc. Full time

    Apple is where individual imaginations gather together, committing to the values that lead to great work. Every new product we build, service we create, or Apple Store experience we deliver is the result of us making each other's ideas stronger. That happens because every one of us shares a belief that we can make something wonderful and share it with the...


  • Santa Clara, United States Apple, Inc. Full time

    Apple is where individual imaginations gather together, committing to the values that lead to great work. Every new product we build, service we create, or Apple Store experience we deliver is the result of us making each other's ideas stronger. That happens because every one of us shares a belief that we can make something wonderful and share it with the...


  • Santa Clara, United States Geospatial And Cloud Analytics Inc Full time

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demand knowledge across different systems, networking, coding, database,...


  • Santa Clara, United States Nvidia Full time

    Senior Site Reliability Engineer - StoragelocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1979072NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, United States Laksan Technologies Full time

    Job DescriptionJob DescriptionExperience in designing and implementing scalable and secure Azure cloud solutions using Data Bricks ML platform. Model development , deployment and job scheduling using unstructured data. Certification in databricks ML Expert level in Designing and Architect solutions in Azure Data factory, Azure Databricks, Azure Datalake,...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, California, United States Platform Ldn Full time

    About Platform LdnPlatform Ldn is a pioneering company in the field of robotics, dedicated to advancing the development of AI platforms that support industrial-grade robotics solutions.Job SummaryWe are seeking a highly skilled Senior Software Engineer to lead the design and development of our AI platform, enabling clients to run their AI workflows...


  • Santa Clara, United States d-Matrix Full time

    Software Engineer, Senior - AI/ML Workloadsd-Matrix - Santa Clara, CALocation Santa Clara, CaType Full timeDepartment R&D - SW Kernels & Workloads d-Matrix has fundamentally changed the physics of memory-compute integration with our digital in-memory compute (DIMC) engine. The “holy grail” of AI compute has been to break through the memory wall to...


  • Santa Clara, United States NVIDIA Full time

    Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and...


  • Santa Clara, United States Veear Full time

    Position: Site Reliability Engineer Location: Remote role Duration: 12+ Months Contract with possible extension Job Description: We seek development-heavy Site Reliability Engineers to design, build, maintain, and scale production services and server farms within our FedRAMP SASE product portfolio. We want passionate engineers who bring new ideas to all...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...

  • ML Data Engineer-I

    4 days ago


    Santa Clara, United States Zortech Solutions Full time

    Job DescriptionJob DescriptionRole: ML Data Engineer-ILocation: Santa Clara, CA/US-RemoteDuration: 6+ MonthsJob Description:Shape of Data, preparing curating data sets for training, fine tuning AI/ML modelsSpark, Hive, Kafka, RedshiftAWS Data Engineer Associate


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...


  • Santa Clara, United States Palo Alto Networks Full time

    Our Mission At Palo Alto Networks everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we’re looking...


  • Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionJoin the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education,...