Senior SRE Engineer, NIM Factory

1 week ago


Santa Clara, United States Sage Lake Senior Living Full time

NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior SRE to monitor and operate both the factory automation for NVIDIA Inference Microservices (NIMs) and its deployed services. The right person for this role brings technical drive and creativity to change the way NVIDIA provides high-performance inferencing for every AI model. Our NIM offerings are easy to use, optimized for performance, and developed using a highly automated software factory. We create containers available for download and hosted services. You will apply your expertise to operate highly available services that make effective use of the thousands of GPU involved in this operation. Your services provide the best-in-class performance, accuracy and availability. We are looking for technical talent to design, build, operate and improve our factory capabilities, including the underlying infrastructure, pipelines, backends, Docker build, test harness, metrics, performance engineering, log ingestion, and more. What youll be doing: Operate a software factory that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the development team, define and deliver rapid iterations on the groups technical strategies and roadmaps to evolve the NIM factory for continuous delivery of packaged NIMs. You will be responsible for both the operation of the factory, its availability, observability, and stability; and will track the deployment of our services into multiple cloud hosts and improve the efficiency, availability, and stability of these services. Partner with internal and external SRE teams to provide the best experience for our developers and our users of the resulting services. Your work ensures our operation is secure with the proper configuration and management of infrastructure including containers, databases, and networking; following and improving standard processes for security, scalability, and cost optimization. This requires working closely with our security teams tasked with responding to security threats. Broad collaboration with multiple AI model teams is needed to understand their requirements and build an efficient infrastructure that supports and improves development and production execution of these models. You will define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you. What we need to see: Demonstrated advanced system engineering skills operating and improving the observability and maintainability of distributed microservice cloud applications and services. Effective experience working with multi-functional teams, principals and architects, and across organizational boundaries. Mentorship, growing teams and team members, and the flexibility to ability to adjust your direction and expectations given the needs of our customers. Experience operating distributed containerize applications using technologies such as Docker, K8s, Cloud Endpoints, Helm, and Prometheus. Use of Infrastructure as code, such at Terraform, Puppet, Ansible or others. Experience identifying the root cause of failures and performance bottlenecks in distributed microservices or cloud systems. Understand and practice good security practices for publicly facing cloud services. BS or MS in Computer Science, Computer Engineering or equivalent experience. 7+ years of shown experience as an SRE or Developer working on high-performance microservices and cloud software. Ways to stand out from the crowd: Excellent communication and interpersonal skills and the ability to engage a multi-functional team. Experience with event-driven applications using various services such as Temporal, Kafka, Redis or others. A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines We are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and creative people in the world working for us. If youre creative and autonomous with a real passion for technology, we want to hear from you. We are an equal opportunity employer and value diversity at our company. The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits . NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr



  • Santa Clara, United States NVIDIA Full time

    NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a SRE Manager to build and manage SREs which monitor and operate both the factory automation for NVIDIA Inference Microservices (NIMs) and its deployed services. The right person for this role brings leadership that encourages the team's technical drive and creativity...


  • Santa Clara, California, United States Sage Lake Senior Living Full time

    About the RoleWe are seeking a seasoned Senior SRE Engineer to join our team at Sage Lake Senior Living, where you will play a critical role in ensuring the high availability and performance of our AI-powered applications.Key ResponsibilitiesOperate and improve the observability and maintainability of our distributed microservice cloud applications and...


  • Santa Clara, California, United States Sage Lake Senior Living Full time

    About the RoleWe are seeking a seasoned Senior SRE Engineer to join our team at Sage Lake Senior Living, where you will play a critical role in monitoring and operating our NVIDIA Inference Microservices (NIMs) factory automation and deployed services.Key ResponsibilitiesOperate a software factory that takes an AI model as input and produces a deployable...

  • Senior SRE Engineer

    17 hours ago


    Santa Clara, United States Trillium Staffing Full time

    Trillium Professional is now seeking Senior SRE Engineers in Santa Clara, CA! Pay rate is $75 - $90/hour, depending on experience. Our client is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced...

  • Senior SRE Engineer

    3 weeks ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its complex and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated NVIDIA's internal Jenkins based CI/CD product for GPUs and Tegra systems. The team works with...

  • Senior SRE Engineer

    1 week ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains NVIDIA’s internal cloud provisioning product for GPUs and Tegra systems. The team works with various...

  • Senior SRE Engineer

    2 months ago


    Santa Clara, United States NVIDIA Full time

    NVIDIA is looking for a seasoned SRE to join its multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains sophisticated Nvidia’s internal cloud provisioning product for GPUs and Tegra systems. The team works...

  • Senior Network SRE

    1 week ago


    Santa Clara, United States TekWissen LLC Full time

    Job DescriptionJob DescriptionOverview: TekWissen Group is a workforce management provider throughout the USA and many other countries in the world. Our client is an American multinational information technology services and consulting company and is a leading provider of information technology, consulting, and business process outsourcing services,...


  • Santa Clara, United States NVIDIA Full time

    Senior SWQA Test Development Engineer - NIM Apply locations US, CA, Santa Clara time type Full time posted on Posted Yesterday job requisition id JR1986946 NVIDIA is the world leader in GPU Computing. We are passionate about markets including gaming, automotive, professional vision, HPC, datacenters, and networking in addition to our traditional OEM...

  • Senior Manager

    4 weeks ago


    Santa Clara, United States NVIDIA Full time

    As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...

  • Sr. SRE Engineer

    3 months ago


    Santa Clara, United States TCWGlobal Full time

    Sr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and...

  • Senior Manager

    3 months ago


    Santa Clara, United States NVIDIA Full time

    As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...

  • Sr. SRE Engineer

    2 months ago


    Santa Clara, United States TCWGlobal Full time

    Job DescriptionJob DescriptionSr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced...

  • Sr. SRE Engineer

    3 months ago


    Santa Clara, United States TCWGlobal Full time

    Job DescriptionJob DescriptionSr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced...


  • Santa Clara, United States NVIDIA Full time

    Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and...


  • Santa Clara, United States Diverse Lynx Full time

    Skills: Site Reliability Engineering (SRE), GIT(Bitbucket), Jenkins, AWS CodeBuild, AWS CodeDeploy Job Description: AWS application and CI/CD pipelines, Microsoft Server admin and workload support (Data center and AWS) •Initial responsibility is application platform promotion to controlled environments for test, staging, and production AWS accounts. o...


  • Santa Clara, California, United States ServiceNow Full time

    Job DescriptionOverviewThe ServiceNow SRE team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability, and performance of the ServiceNow cloud infrastructure.Key ResponsibilitiesProvide relief and sustainable resolution to issues within our infrastructure.Use expertise in software development,...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, United States NVIDIA Full time

    Senior Site Reliability Engineer, Data Science and ML Platforms Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIA's data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is...