Current jobs related to Sr. Site Reliability Engineer - Santa Clara - TCWGlobal


  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States Insight Global Full time

    Site Reliability EngineerAbout the RoleWe are seeking a seasoned Site Reliability Engineer to join our team at Insight Global. As a key member of our Infrastructure, Planning and Processes organization, you will be responsible for developing and maintaining sophisticated internal cloud provisioning products.Key ResponsibilitiesCollaborate with various teams,...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain scalable and highly available cloud...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleWe are looking for a seasoned engineer with expertise in...


  • Santa Clara, United States Palo Alto Networks Full time

    Our Mission At Palo Alto Networks everything starts and ends with our mission: Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we’re looking...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and outstanding people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • santa clara, United States Insight Global Full time

    Site Reliability Engineer:Duration: 6 month contract to hire (based on performance)Location: Santa Clara, CASchedule: On-site 8am-5pmInsight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and reliable infrastructure to support our mission-critical platforms.Key ResponsibilitiesDesign and implement scalable and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, United States Insight Global Full time

    Site Reliability Engineer:Duration: 6 month contract to hire (based on performance)Location: Santa Clara, CASchedule: On-site 8am-5pmInsight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The...


  • Santa Clara, United States Insight Global Full time

    Site Reliability Engineer:Duration: 6 month contract to hire (based on performance)Location: Santa Clara, CASchedule: On-site 8am-5pmInsight Global is looking for a seasoned SRE to join one of our largest technology clients' multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The...


  • Santa Clara, United States Centrify Corporation Full time

    Our software runs on public clouds with 99.9% or better uptime and is mission critical for our customers. Our cloud operations team is where the rubber meets the road and needs innovative Site Reliability Engineers. Join a professional team of smart and hard-working professionals building enterprise-class cloud-based services in the rapidly growing market of...


  • Santa Clara, California, United States Centrify Corporation Full time

    Cloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerAt Palo Alto Networks, we're seeking a highly skilled Principal Site Reliability Engineer to join our Global Customer Operations team. As a key member of our team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product...

Sr. Site Reliability Engineer

3 months ago


Santa Clara, United States TCWGlobal Full time

Sr. SRE Engineer

W2 Contract to Possible Hire

Hybrid, Santa Clara, CA

$75-90/hr + PTO, Paid Holidays, Benefits


We are looking for a seasoned SRE to join our multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-paced crew that develops and maintains our internal cloud provisioning product for GPUs and Tegra systems.


The team works with various other business units such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure & systems needs.


What you’ll be doing:

  • Working on systems deployed in our internal cloud making them available and reliable for our end users.
  • Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
  • Providing high quality of user support.
  • Monitoring KPIs and making sure that team’s SLAs are met.
  • Managing and maintaining production of Kubernetes clusters.
  • Drive automation of monitoring to gain more insight into applications and system health.
  • Craft and develop tools needed for automating workflows.
  • Develop, Improve and Maintain our infrastructure codebase.
  • Craft and implement critical metrics using various analytics methods and dashboards.
  • Take part in prototyping, crafting, and developing cloud infrastructure
  • Reuse AI techniques to extract useful signals about machines and jobs from the data generated.


What we need to see:

  • Experience of maintaining cloud infrastructure and highly available production environment.
  • Experience managing systems installed data centers. Proficient with BMC (Redfish), KVM, and IPMI tools.
  • Working knowledge of Openstack.
  • Background in Databases like SQL (MySQL) and timeseries DBs like Prometheus.
  • Strong knowledge of networking principles and protocols, including TCP/IP, DNS, DHCP, and VLANs.
  • Experience with data analytics/visualization tools like Kibana, Grafana, Splunk etc.
  • Strong Ansible skills. Experience with Ansible AWX.
  • Strong background with Jenkins and/or other CI/CD systems.
  • Proficient with Kubernetes, dockers & virtualization.
  • Proficient using source code management and binary repository systems like GitLab, GitHub, Artifactory, Perforce etc.
  • Knowledge of monitoring systems such as Zabbix, Prometheus, PagerDuty and/or similar systems.
  • Advanced knowledge of standard methodologies related to security.
  • 5+ years of proven experience.
  • Bachelor's degree in Computer Science, Information Technology, or related field, or equivalent experience.


Ways to stand out from the crowd:

  • Previous experience with SRE teams managing on-prem infrastructure.
  • Experience managing hardware like GPUs and Tegras.
  • Thrives in a multi-tasking environment with constantly evolving priorities.
  • Prior experience with large scale operations team.
  • Experience with Windows server infrastructure.
  • Outstanding interpersonal skills and communication with all levels of management.
  • Experience with using and improving data centers.
  • Ability to analyze sophisticated problems into simple sub problems and then reuse available solutions to implement most of those.
  • Ability to design simple systems that can work efficiently without needing much support.