Senior Site Reliability Engineer, Data Science Platform

3 weeks ago


Olympia, United States NVIDIA Full time

We are now looking for a Sr. Site Reliability Engineer (SRE), Data Science Platform. At NVIDIA, we pride ourselves on data-driven decision-making, and the data science team is at the heart of this initiative. We are looking for an excellent Sr. Site Reliability Engineer with extensive data infrastructure experience for our data science platform supporting NVIDIA's cloud platform services. Our data science platform serves as the basis for advanced real-time data analytics, streaming, data lake, and sophisticated ML/AI training with offline/online inferencing for NVIDIA's cloud services. Site Reliability Engineering is an engineering discipline to design, build, and maintain large-scale production systems with high efficiency and availability using the combination of software and systems engineering practices. SRE at NVIDIA ensures reliability and uptime as promised to the users while enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency, and performance. SRE is also a mindset and a set of engineering approaches to running better production systems and optimizations. The person in this position will be responsible for Service Response and Workflows and will drive tools/service development to maintain and improve service SLOs. What You’ll Be Doing Working on building tools to improve the SRE Observability and rapidly debug and triage incidents and user-reported issues. Make valuable contributions to the overall health, performance, and reliability of NVIDIA's Cloud Data Science platform and Infrastructure Services. Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management, and launch reviews. Clear SRE Observability understanding and experience in building new tools and automation using Python/GO. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. Practice balanced incident response and blameless postmortems. What We Need To See MS or BS in Computer Science/Engineering or a related field or equivalent experience. 5+ years Site reliability engineering experience working on large scale distributed microservices in a production environment with a real passion for automation and tooling. SRE approach and who can understand Error budgeting, SLO’s, SLA’s. Clear understanding of Incident management, change management, and problem management processes. Ability to detect all service-impacting issues, accurate triage, partner communication, impact containment, service restoration, and post-incident follow-up. Proven strengths in problem-solving and root causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line. Strong experience on streaming data infra services involving web services, Kafka, Spark, etc. Expert knowledge with building and operating large scale observability platforms for monitoring and logging (ELK, Prometheus, etc). Excellent interpersonal skills including the ability to identify and communicate data-driven insights. Ways To Stand Out From The Crowd Experience with operating large scale distributed systems with strong SLAs. Excellent scripting: Python, GO. Strong experience on operating data platforms. NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for great people like you to help us accelerate the next wave of artificial intelligence. The base salary range is 144,000 USD - 270,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. #J-18808-Ljbffr



  • Olympia, United States NVIDIA Full time

    We are now looking for a Sr. Site Reliability Engineer (SRE), Data Science Platform. At NVIDIA, we pride ourselves on data-driven decision-making, and the data science team is at the heart of this initiative. We are looking for an excellent Sr. Site Reliability Engineer with extensive data infrastructure experience for our data science platform supporting...


  • Olympia, United States Sift Full time

    ABOUT US: At Sift, we’re accelerating the development of next-generation machines with the world’s first end-to-end telemetry stack — and we’re expanding our team. Sift’s founders started this company in order to enable private and government entities to make big, technological strides. Our focus is centered on creating a better world for tomorrow,...

  • Data Science Analyst

    1 month ago


    Olympia, United States Washington Health Benefit Exchange Full time

    Job DescriptionJob DescriptionThe mission of Washington Health Benefit Exchange (Exchange) is to radically improve how Washington residents secure health insurance through innovative and practical solutions, an easy-to-use customer experience, our values of integrity, respect, equity and transparency, and by providing undeniable value to the health care...


  • Olympia, Washington, United States Oracle Full time

    Job DescriptionWe are facing several engineering challenges in critical foundational data-plane services that powers the next gen OCI cloud. We need you to challenge existing engineering assumptions and boundaries, bring in your expertise in highly performant, reliable, available system engineering to take OCI data-planes to the next level.This is your...


  • Olympia, United States myGwork - LGBTQ+ professionals & allies Full time

    This inclusive employer is a member of myGwork – the largest global platform for the LGBTQ+ business community. Summary The Senior Software Developer is responsible for analysis, design, implementation, and unit testing to produce high-quality code for a project team responsible for supporting a number of cutting-edge assessment technology platforms. They...


  • Olympia, Washington, United States City Of Tumwater Full time

    Salary: $105,128,052.00 Annually Location: Tumwater, WA Job Type: Full-time Department: Transportation and Engineering General OverviewThe City of Tumwater is a thriving community known for its rich history and vibrant opportunities. Situated at the base of Puget Sound, it serves as a gateway to the Seattle/Tacoma metropolitan area, offering easy access to...


  • Olympia, United States Epic Games Full time

    ENGINEERING - UNREAL ENGINE What We Do Unreal-powered projects have been on the bleeding edge of real-time entertainment for over 20 years. Our team of engineering experts are always innovating to improve the tools and technology that empower content developers worldwide. What You'll Do We are looking for a Software Engineer passionate about creating...


  • Olympia, Washington, United States LDC, Inc. Full time

    LDC, Inc.Civil Engineer/Project ManagerJob OverviewCompany BackgroundFor over two decades, LDC has established a reputation for excellence in engineering solutions. Our commitment to delivering outstanding results has earned us the trust of our clients and the community. Founded on the principle of "Service Above the Standard," we have navigated challenges...


  • Olympia, United States iSeatz Full time $105,000 - $179,000

    Job DescriptionJob DescriptionOur MissioniSeatz provides digital commerce and loyalty tech solutions that enable travel and lifestyle bookings to global customers including American Express, Expedia, and IHG Hotels. Our proprietary platform processes more than $4B a year in transactions.We have a history of long-term trusted relationships and innovation that...


  • Olympia, Washington, United States LDC, Inc. Full time

    LDC, Inc.Civil Engineer/Project ManagerJob OverviewCompany BackgroundWith a legacy spanning over two decades, LDC has earned a reputation for excellence among clients and within the engineering community. Our mission is to deliver outstanding solutions that consistently exceed client expectations. Founded on the principle of "Service Above the Standard," we...


  • Olympia, United States State of Washington Full time

    DESCRIPTION Enterprise Data Architect (ITAE/ETS This recruitment is posted continuously. The hiring manager reserves the right to close the posting at any time once a selection has been made. This position is responsible for enterprise-level data architecture that has effects agencywide as well as with external partners. The volume and use of data between...


  • Olympia, United States Washington State Liquor and Cannabis Board Full time

    Your opportunity at a glance The WSLCB Information Technology Services Division is announcing an exciting opportunity for Data Science Developer (IT Data Management --Senior/Specialist) *in Olympia, WA*{rel="nofollow"}. This position reports to the LEEADS and Data Analytics Services Manager in the WSLCB's Information Technology Services Division (ITSD). In...


  • Olympia, United States Vets Hired Full time

    Responsibilities Provide senior thought leadership and significant technical contributions to help develop target software defined network architectures and recommendations, for the evolution of the network technology stack ,in close collaboration with stakeholders. The Senior Principal will focus at the outset around designing and developing the network...


  • Olympia, Washington, United States Autodesk Full time

    Job Requisition ID #24WD79935Position OverviewWe are seeking a dynamic and experienced Technical Leader to focus on building software that helps developers to be as productive as possible and with a platform mindset. Do you care deeply about alignment and clearing paths to get things done, doing so with an extensive background in software architecture? The...


  • Olympia, Washington, United States LDC, Inc. Full time

    LDC, Inc.Civil Engineer/Project ManagerJob OverviewCompany BackgroundLDC, Inc. has established a strong reputation over the past two decades for delivering high-quality engineering solutions. Our mission revolves around providing exceptional service that exceeds client expectations. With a commitment to innovation and excellence, LDC has successfully...


  • Olympia, Washington, United States LDC, Inc. Full time

    LDC, Inc.Civil Engineer/Project ManagerPosition OverviewWith over two decades of excellence, LDC has established a reputation for delivering high-quality engineering solutions that exceed client expectations. Our foundational principle, "Service Above the Standard," has guided our growth and innovation, allowing us to thrive even during challenging economic...

  • Software engineer

    1 month ago


    Olympia, United States META Full time

    Summary: Meta is seeking an experienced Software Engineer to join the Software Engineering (Infrastructure) team. The Software Engineering (Infrastructure) team builds large distributed components that run Facebook. Our code serves millions of requests per second and it does so with sub-second latency and in a fault-tolerant manner. We handle everything...


  • Olympia, Washington, United States LDC, Inc. Full time

    LDC, Inc.Civil Engineer/Project ManagerPosition OverviewWith over two decades of excellence, LDC has established a reputation for delivering outstanding engineering solutions that consistently exceed client expectations. Our foundation is built on the principle of "Service Above the Standard," which drives our commitment to quality and innovation. As a...


  • Olympia, United States The College Board Full time

    About the Team The Digital Assessment team is committed to making higher education accessible to every student through innovative technology, building cutting edge applications to deliver College Board's suite of assessments. We are constantly seeking and experimenting with new technology, using cutting edge tools to deliver world class exam experiences to...


  • Olympia, United States The College Board Full time

    About the Team The Digital Assessment team is committed to making higher education accessible to every student through innovative technology, building cutting edge applications to deliver College Board's suite of assessments. We are constantly seeking and experimenting with new technology, using cutting edge tools to deliver world class exam experiences to...