Site Reliability Engineering Manager

2 weeks ago


Santa Clara, California, United States Promote Project Full time

About the Company: Promote Project is at the forefront of innovation, focusing on redefining technology and enhancing the capabilities of AI. We are dedicated to creating groundbreaking solutions that push the boundaries of what is possible in computing.

Position Overview: We are seeking a Manager for Site Reliability Engineering to spearhead our cloud service team. This role involves supporting, troubleshooting, and developing generative AI-driven visual applications. As an SRE, you will oversee the interconnectivity of our systems, employing a diverse range of tools and methodologies to address various challenges. Our SRE practices emphasize product quality through proactive measures, including minimizing reactive operational tasks, conducting blameless postmortems, identifying potential outages, and implementing iterative enhancements.

Key Responsibilities:
  1. Build and mentor a team of SREs, guiding them toward achieving collective objectives.
  2. Foster a collaborative, innovative, and continuously improving culture within the SRE team.
  3. Your team will manage and support pioneering Generative AI inferencing workloads across a globally distributed environment.
  4. Ensure optimal performance and availability on both current and future GPU architectures.
  5. Work closely with service owners, architecture, research, and tools teams to achieve optimal results for AI-related challenges.
  6. Participate in an on-call rotation to monitor and support critical high-performance services across multiple cloud platforms.
  7. Communicate service KPIs, priorities, and issues to leadership while facilitating effective incident responses.
  8. Collaborate with security teams to uphold security best practices and compliance with relevant standards.
Qualifications:
  1. Advanced degree (MS or PhD) in engineering or computer science, or equivalent experience.
  2. Over 8 years of experience managing the availability and performance of critical services in a live production environment.
  3. At least 6 years of technical leadership experience, including project scoping, requirements gathering, and leading multiple engineering teams.
  4. Experience in leading engineering projects with a focus on cloud technologies (AWS/AZURE/GCP/OCI), coding, networking, and operating systems.
  5. Strong understanding of containerization and microservices architecture, particularly Kubernetes.
  6. In-depth knowledge of the Kubernetes ecosystem and best practices.
  7. Experience in managing production activities, including change management and software automation across various programming languages.
  8. Expertise in SLO/SLIs, error budgeting, and configuring complex services.
Preferred Qualifications:
  1. Experience with containerization and cloud-based deployments for AI models.
  2. Proficient coding skills in Python, Go, or similar languages.
  3. Prior experience in addressing production issues and providing on-call support.
  4. Understanding of Deep Learning, Machine Learning, and AI technologies.
  5. Familiarity with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton.

Compensation: Promote Project offers competitive salaries and a comprehensive benefits package, making it one of the most sought-after employers in the technology sector. Our teams are composed of some of the most innovative minds in the industry, working in impactful fields such as Deep Learning, Artificial Intelligence, and Autonomous Vehicles.

If you are a creative engineer who values autonomy and shares our passion for technology, we encourage you to explore this opportunity with us.

Diversity Commitment: Promote Project is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We value diversity in our workforce and do not discriminate based on any characteristic protected by law.

Company Vision: Promote Project is a leader in accelerated computing, tackling challenges that others cannot. Our advancements in AI and technology are transforming industries and making a significant impact on society.



  • Santa Clara, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based applications and infrastructure.Key ResponsibilitiesDesign, implement, and maintain cloud infrastructure on...


  • Santa Clara, California, United States Promote Project Full time

    About the Company: Promote Project is at the forefront of innovation, leveraging cutting-edge technology to redefine the landscape of AI and computing. Our mission is to harness the power of advanced computing to create transformative solutions that impact various industries.Position Overview: We are seeking a Manager of Site Reliability Engineering to...


  • Santa Clara, California, United States Promote Project Full time

    About Promote Project: Promote Project is a leader in innovative technology solutions, dedicated to pushing the boundaries of what is possible in the realm of artificial intelligence and cloud computing. Our commitment to excellence is reflected in our talented workforce and our pursuit of groundbreaking advancements.Position Overview: We are seeking a...


  • Santa Clara, California, United States Centrify Corporation Full time

    **About Centrify Corporation**Centrify Corporation is a leading provider of cloud-based identity and access management solutions. Our software runs on public clouds with 99.9% or better uptime and is mission critical for our customers.**Job Summary**We are seeking a highly skilled Cloud Site Reliability Engineer to join our Cloud DevOps team. As a Cloud Site...


  • Santa Clara, California, United States Veear Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Veear. As a key member of our infrastructure team, you will play a critical role in ensuring the reliability, scalability, and security of our cloud-based systems.Key ResponsibilitiesCollaboration and PartnershipPartner with cross-functional teams to ensure security...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewCompany OverviewTo comply with U.S. federal government requirements, U.S. citizenship is required for this position.Our MissionAt Palo Alto Networks, our mission is clear:To be the cybersecurity partner of choice, safeguarding our digital existence.We envision a world where each day is safer and more secure than the last. Our foundation is built...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to create a better world for everyone, driven by our talented workforce. We prioritize speed and innovation to meet the demands of our customers and communities.Joining ServiceNow means becoming part of a dynamic team of innovators who possess a relentless curiosity and a commitment to creativity.We...


  • Santa Clara, California, United States ServiceNow Full time

    Company OverviewAt ServiceNow, we harness technology to enhance global operations, and our dedicated workforce makes it all possible. We operate swiftly because the world demands it, innovating uniquely for our clients and communities.By becoming part of ServiceNow, you join a dynamic team of innovators who possess a relentless curiosity and a passion for...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables unique creativity and discovery, and powers what were...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA, a prominent player in the realms of Artificial Intelligence, High-Performance Computing, and Visualization, is on the lookout for a Lead Site Reliability Engineer specializing in HPC storage systems. This role involves collaborating with our team to architect, implement, and enhance on-premises HPC storage solutions while integrating cloud...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is dedicated to its mission of being the cybersecurity partner of choice, safeguarding our digital existence. Our vision is to create a world that is increasingly secure and safe.We are a company that thrives on innovation and challenges the conventional ways of operating. We seek forward-thinking individuals who are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is driven by a mission to be the cybersecurity partner of choice, safeguarding our digital lifestyle. Our vision is to create a world that is increasingly secure and safe.We are built on the principles of innovation and disruption, seeking individuals who are passionate about shaping the future of cybersecurity.Work...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is dedicated to our mission of being the cybersecurity partner of choice, ensuring the safety of our digital lives. Our vision is to create a world that is increasingly secure and resilient.We pride ourselves on challenging the conventional approaches to cybersecurity and are in search of innovative thinkers who are eager...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Company OverviewPalo Alto Networks is dedicated to its mission of being the cybersecurity partner of choice, safeguarding our digital lifestyle. Our vision is to create a world that is increasingly safe and secure.We are built on the principles of challenging the norm and disrupting conventional practices. We seek innovators who are committed to shaping the...


  • Santa Clara, California, United States OMNIVISION Full time

    Job Overview We are seeking a Staff Reliability Engineer to join our team at OMNIVISION. The ideal candidate will possess a strong educational background and relevant experience in the field of reliability engineering. Qualifications: A Bachelor’s degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:A Bachelor’s degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics is required. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:A Bachelor’s degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework in semiconductor physics and electronics is required. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also...


  • Santa Clara, California, United States Omnivision Technologies Full time

    Qualifications:Bachelor's degree in Physics, Electrical Engineering, Materials Science, or a related engineering field, with coursework focused on semiconductor physics and electronics. Familiarity with electronic component reliability standards such as JEDEC and AEC-Q100 is advantageous. Experience in wafer-level reliability testing is also beneficial.Key...