Sr. Site Reliability Engineer

4 weeks ago


Santa Clara, United States TCWGlobal Full time

Sr. SRE Engineer

W2 Contract to Possible Hire

Hybrid, Santa Clara, CA

$75-90/hr + PTO, Paid Holidays, Benefits


We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains our internal cloud provisioning product for GPUs and Tegra systems.


The team works with various other business units such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure & systems needs.


What you’ll be doing:

  • Working on systems deployed in our internal cloud making them available and reliable for our end users.
  • Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
  • Providing high quality of user support.
  • Monitoring KPIs and making sure that team’s SLAs are met.
  • Managing and maintaining production of Kubernetes clusters.
  • Drive automation of monitoring to gain more insight into applications and system health.
  • Craft and develop tools needed for automating workflows.
  • Develop, Improve and Maintain our infrastructure codebase.
  • Craft and implement critical metrics using various analytics methods and dashboards.
  • Take part in prototyping, crafting, and developing cloud infrastructure
  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.


What we need to see:

  • Experience of maintaining cloud infrastructure and highly available production environment.
  • Experience managing systems installed data centers. Proficient with BMC (Redfish), KVM, and IPMI tools.
  • Working knowledge of Openstack.
  • Background in Databases like SQL (MySQL) and timeseries DBs like Prometheus.
  • Strong knowledge of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs.
  • Experience with data analytics/visualization tools like Kibana, Grafana, Splunk etc.
  • Strong Ansible skills. Experience with Ansible AWX.
  • Strong background with Jenkins and/or other CI/CD systems.
  • Proficient with Kubernetes, dockers & virtualization.
  • Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce etc.
  • Knowledge of monitoring systems such as Zabbix, Prometheus, PagerDuty and/or similar systems.
  • Advanced knowledge of standard methodologies related to security.
  • 5+ years of proven experience.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.


Ways to stand out from the crowd:

  • Previous experience with SRE teams managing on-prem infrastructure.
  • Experience managing hardware like GPUs and Tegras.
  • Thrives in a multi-tasking environment with constantly evolving priorities.
  • Prior experience with large scale operations team.
  • Experience with Windows server infrastructure.
  • Outstanding interpersonal skills and communication with all levels of management.
  • Experience with using and improving data centers.
  • Ability to analyze sophisticated problems into simple sub problems and then reuse available solutions to implement most of those.
  • Ability to design simple systems that can work efficiently without needing much support.



  • Santa Clara, United States TCWGlobal Full time

    Sr. SRE EngineerW2 Contract to Possible HireHybrid, Santa Clara, CA$75-90/hr + PTO, Paid Holidays, Benefits We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains...


  • Santa Clara, United States Palo Alto Networks Full time

    Sr Site Reliability Engineer (Cortex XDR Cloud) Palo Alto Networks Implement Zero Trust, Secure your Network, Cloud workloads, Hybrid Workforce, Leverage Threat Intelligence & Security Consulting. Cybersecurity Services & Education for CISO’s, Head of Infrastructure, Network Security Engineers, Cloud... View company page At Palo Alto Networks everything...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and outstanding people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Santa Clara, United States Palo Alto Networks Full time

    Our Mission At Palo Alto Networks everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we’re looking for...


  • Santa Clara, United States Kofi Group Full time

    To Apply for this Job Click HerePrincipal Site Reliability EngineerSan Francisco Bay Area, CAWe are partnering with a late-stage Cloud Security company that is looking for a Principal Level SRE The ideal candidate will have:Strong sense of architecture and design for fault tolerance, scale-out approaches, and stability Deep experience in building tools...


  • Santa Clara, United States Sustainable Talent Full time

    Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA . We offer competitive pay based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company...


  • Santa Clara, United States Sustainable Talent Full time

    Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA. We offer competitive pay based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company...


  • Santa Clara, California, United States Nvidia Full time

    Senior Site Reliability Engineer - StoragelocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1979072NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Santa Clara, United States Sustainable Talent Full time

    Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full...


  • Santa Clara, United States Sustainable Talent Full time

    Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full...


  • Santa Clara, United States Cryptoware Technologies Inc Full time

    Job Description Responsibility Lead the effort of global expansion of Huobi globe-spanning infrastructure. Work with engineering teams to ensure new features and changes are deployed quickly and safely. Constantly improve our system performance and reliability through better tools, processes, and monitoring systems. Staffing an on-call rotation with HQ in...


  • Santa Clara, United States Sustainable Talent Full time

    Job DescriptionJob DescriptionJoin the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based in Santa Clara, CA, with Hybrid work options. We offer competitive pay $75 - $90/hr based on factors like experience, education,...


  • Santa Clara, United States Nvidia Full time

    Senior Site Reliability Engineer - StoragelocationsUS, CA, Santa Claratime typeFull timejob requisition idJR1979072NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and...


  • Santa Clara, United States Palo Alto Networks Full time

    Job DescriptionJob DescriptionCompany DescriptionOur MissionAt Palo Alto Networks® everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting...


  • Santa Clara, United States Palo Alto Networks Full time

    Company DescriptionOur Mission At Palo Alto Networks® everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done,...


  • Santa Barbara, United States AppFolio Full time

    We are hiring a Senior Site Reliability Engineer to run and evolve AppFolio Investment Manager's ecosystem of services. This is an ideal opportunity for someone with a desire to help own/maintain as well as 'teach to fish' fully 'shifted left' development teams and a passion for building reliable yet simple systems. This position, as with all members of...


  • Santa Clara, United States NVIDIA Full time

    Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and...


  • Santa Clara, United States Cryptoware Technologies Inc Full time

    Job DescriptionJob DescriptionResponsibility•       Lead the effort of global expansion of Huobi globe spanning infrastructure.•       Work with engineering teams to make sure new features and changes are deployed quickly and safely.•       Constantly improve our system performance and reliability through better tools, process and...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...