Current jobs related to Site Reliability Engineering Manager - Santa Clara, California - Promote Project


  • Santa Clara, California, United States Insight Global Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Insight Global. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design, implement, and maintain scalable and highly...


  • Santa Clara, California, United States Insight Global Full time

    Site Reliability EngineerAbout the RoleWe are seeking a seasoned Site Reliability Engineer to join our team at Insight Global. As a key member of our Infrastructure, Planning and Processes organization, you will be responsible for developing and maintaining sophisticated internal cloud provisioning products.Key ResponsibilitiesCollaborate with various teams,...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...


  • Santa Clara, California, United States Cryptoware Technologies Inc Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Cryptoware Technologies Inc. As a Site Reliability Engineer, you will be responsible for leading the effort of global expansion of Huobi globe-spanning infrastructure.Key Responsibilities:Lead the effort of global expansion of Huobi...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Site Reliability EngineerWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization. This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.Key...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our Applications Infrastructure organization at NVIDIA. This team is responsible for designing, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications hosted in the cloud.Key Responsibilities:Develop and integrate new...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.About the RoleThis is a unique opportunity to work with a...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Centrify Corporation Full time

    Cloud Site Reliability EngineerAt Centrify Corporation, we're seeking a skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a key member of our operations team, you'll play a critical role in ensuring the uptime and delivery of our cloud-based services.Key Responsibilities:Manage our cloud application using DevOps and Agile practices to...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS, including EC2,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on AWS, including EC2, SSM,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure. You will work closely with developers, researchers, data scientists, and security experts to ensure...


  • Santa Clara, California, United States Veear Full time

    Job Description:We are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the security, compliance, and reliability of our systems.Key Responsibilities:Partner with development teams to ensure that applications have scalability and reliability...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based security solutions.Key ResponsibilitiesDesign, build, and maintain scalable and reliable infrastructure for our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable, secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of Cloud ServicesWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The SRE role focuses on...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our infrastructure platform, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teams by developing expertise...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in designing, building, and maintaining scalable and reliable infrastructure for our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Global Customer Operations team at Palo Alto Networks. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...

Site Reliability Engineering Manager

2 months ago


Santa Clara, California, United States Promote Project Full time

About the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.

Position Overview: We are seeking a Manager of Site Reliability Engineering to spearhead our cloud service team, dedicated to supporting, diagnosing, and developing generative AI-driven visual applications. As a Site Reliability Engineer (SRE), you will oversee the interconnectedness of our systems, employing a diverse array of tools and methodologies to address a wide range of challenges. Our SRE practices are essential to maintaining product excellence, emphasizing proactive measures, thorough postmortems, and continuous enhancements, ensuring a dynamic and engaging work environment.

Key Responsibilities:
  1. Build and mentor a team of SREs, fostering an environment of collaboration and innovation.
  2. Promote a culture of continuous improvement within the SRE team.
  3. Your team will manage and support groundbreaking generative AI workloads across a globally distributed network, encompassing numerous edge locations and major cloud service providers.
  4. Optimize performance and availability on both current and future GPU architectures.
  5. Collaborate with service owners, architecture, research, and tools teams to achieve optimal outcomes for AI challenges.
  6. Participate in an on-call rotation, monitoring and supporting critical high-performance services across multi-cloud environments.
  7. Communicate service KPIs, priorities, and issues to leadership while driving effective incident responses.
  8. Work alongside security teams to implement best practices and ensure compliance with relevant standards.
Qualifications:
  1. Master's or PhD in an engineering or computer science-related discipline, or equivalent experience.
  2. A minimum of 8 years of experience managing end-to-end availability and performance of significant services in a live production environment, either as an SRE or Service Owner.
  3. At least 6 years of technical leadership experience, including project scoping, requirements gathering, and influencing multiple engineering teams.
  4. Experience leading engineering initiatives with a focus on cloud technologies (AWS/AZURE/GCP/OCI), coding, networking, operating systems, and storage.
  5. Strong understanding of containerization and microservices architecture, particularly Kubernetes.
  6. Extensive knowledge of the Kubernetes ecosystem and best practices.
  7. Lead significant production activities, including change management, post-mortem reviews, and software automation across various programming languages (Python, Golang) and technologies (CI/CD auto-remediation, alert correlation).
  8. Proficient in understanding SLO/SLIs, error budgeting, KPIs, and configurations for complex services.
Preferred Qualifications:
  1. Experience with containerization and cloud-based deployments for AI models.
  2. Strong coding skills in Python, Go, or similar languages.
  3. Prior experience managing production issues and providing on-call support.
  4. Knowledge of Deep Learning, Machine Learning, and AI.
  5. Familiarity with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.

Compensation: Promote Project offers competitive salaries and a comprehensive benefits package, making it one of the most sought-after employers in the technology sector. Our teams consist of some of the most innovative minds in the industry, working in impactful fields such as Deep Learning, Artificial Intelligence, and Autonomous Vehicles.

If you are a creative engineer who thrives in an autonomous environment and shares our passion for technology, we encourage you to explore this opportunity. Your base salary will be determined based on your experience and the compensation of employees in similar roles. Equity and benefits are also part of the compensation package.

Promote Project is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We value diversity in our workforce and do not discriminate based on any characteristic protected by law.

At Promote Project, we are dedicated to pioneering accelerated computing to tackle challenges that others cannot. Our work in AI and the metaverse is transforming industries and making a profound impact on society.