Site Reliability Engineering Manager, AI Platform

2 weeks ago


San Jose, California, United States Adobe Full time
About the Role

We are seeking an exceptional Site Reliability Engineering Manager to lead our team in driving reliability for Adobe's AI Inference Platform, Adobe Firefly. As a key member of our Engineering organization, you will be responsible for developing a team of Site Reliability Engineers who will work closely with our Engineering teams to build, scale, and secure the AI Platform.

Key Responsibilities
  • Guide the technical vision and roadmap for AI Platform Inference infrastructure.
  • Grow and lead a team of dedicated SRE engineers.
  • Engage with Firefly Engineering and Firefly App Integrations team to understand their needs and goals to drive the platform's reliability.
  • Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency.
  • Ensure the highest uptime and Quality of Service (QoS) for Adobe's customers through operational excellence.
  • Define service level objectives (SLOs) and indicators (SLIs) to represent and measure service quality.
  • Support and maintain globally distributed, multi-cloud (public and/or private) environments.
  • Automate common, repeatable tasks at a large scale to streamline operational procedures.
  • Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc.
  • Coordinate with other Adobe platform teams and service providers (primarily AWS) to innovate on Generative AI as a Service.
  • Ensure inference services improve GPU utilization, scale models independently, and optimize COGs.
Requirements
  • A BS or MS degree in Computer Science, Electrical Engineering, a related field, or equivalent industry experience.
  • 3+ years of experience as an Engineering Manager.
  • Experience in building and scaling distributed systems, as well as experience with containerization and orchestration technologies like Kubernetes.
  • Strong communication and collaboration skills - building strong relationships with internal customers and external partners.
  • Dedication to team-work, self-organization, and continuous improvement.
  • A track record of leading high-performance teams to deliver results in a fast-paced and dynamic environment of AI infrastructure.
  • Production level expertise with containerization orchestration engines (e.g. Kubernetes) and demonstrated understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git).
  • Fundamental programming skills, ideally practical experience in one (and preferably more) of the following languages: Python, Go or Java.
  • An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions - familiarity with Pytorch, SageMaker, HuggingFace, NVIDIA TensorRT or OpenAI Triton a plus.
About Adobe

Adobe is proud to be an Equal Employment Opportunity and affirmative action employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other applicable characteristics protected by law.

Adobe aims to make its website and application process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, please call us at [phone number].



  • San Jose, California, United States Adobe Full time

    About the RoleWe are seeking an exceptional Site Reliability Engineering Manager to lead our team in driving reliability for Adobe's AI Inference Platform, Adobe Firefly. As a key member of our Engineering organization, you will be responsible for developing a team of Site Reliability Engineers who will work closely with our Engineering teams to build,...


  • San Jose, California, United States Adobe Full time

    About the RoleWe're seeking an exceptional Site Reliability Engineering Manager to lead our AI Platform Inference Infrastructure team at Adobe. As a key member of our organization, you'll be responsible for driving reliability, scalability, and security for our AI Inference Platform, Adobe Firefly.Key ResponsibilitiesDevelop and execute the technical vision...


  • San Jose, California, United States Adobe Full time

    About the RoleWe are seeking an exceptional Site Reliability Engineer to join our team at Adobe, working on the AI Training Platform, Adobe Firefly. As a key member of our team, you will collaborate closely with Engineering teams to build, scale, and secure the AI Platform, enabling Firefly product teams to easily manage and deploy Machine Learning...


  • San Jose, California, United States Trianz Full time

    About TrianzTrianz is a leading-edge technology platforms and services company that accelerates digital transformations at Fortune 100 and emerging companies worldwide in data & analytics, digital experiences, cloud infrastructure, and security.Our VisionWe believe that companies around the world face three challenges in their digital transformation journeys...


  • San Francisco, California, United States Snorkel AI, Inc. Full time

    About the RoleWe are seeking an experienced Engineering Manager to lead our AI Platform team at Snorkel AI, Inc. This is a unique opportunity to join a cutting-edge technology company and contribute to the development of innovative AI solutions.Key ResponsibilitiesLead a team of talented engineers to design, develop, and deploy large-scale data-focused AI...


  • San Francisco, California, United States Perplexity AI Full time

    Site Reliability EngineerPerplexity AI is seeking a skilled Site Reliability Engineer to join our team and contribute to the development of our cutting-edge conversational answer engine.As a Site Reliability Engineer, you will be responsible for designing, implementing, and scaling the infrastructure and systems that support our web and mobile products.Key...


  • San Jose, California, United States Trianz Full time

    About TrianzTrianz is a leading-edge technology platforms and services company that accelerates digital transformations at Fortune 100 and emerging companies worldwide in data & analytics, digital experiences, cloud infrastructure, and security.Our VisionWe believe that companies around the world face three challenges in their digital transformation journeys...


  • San Jose, California, United States Tik Tok Full time

    Job Title: Site Reliability Engineer, Data PlatformTikTok is a leading destination for short-form mobile video, and our mission is to inspire creativity and bring joy. Our platform is built to help imaginations thrive, and we're looking for a Site Reliability Engineer to join our Data Platform team.Responsibilities:Ensure the reliability of all TikTok's...


  • San Francisco, California, United States Mistral AI Full time

    About Mistral AIMistral AI is a cutting-edge technology company dedicated to making AI ubiquitous and open. Our mission is to drive innovation and excellence in the field of artificial intelligence. We are a tight-knit, nimble team that thrives in a competitive environment, and we're passionate about AI.Job SummaryWe're seeking an experienced Cloud...


  • San Diego, California, United States Platform Science Full time

    About UsAt Platform Science, we're revolutionizing the way businesses connect and interact with the world around them. Our open IoT platform empowers innovative fleets, application developers, and equipment providers to deliver cutting-edge solutions to supply chain professionals globally.The RoleWe're seeking a highly skilled Senior Site Reliability...


  • San Diego, California, United States Platform Science Full time

    About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our team in San Diego, CA (or remote). As a key member of our SRE team, you will be responsible for ensuring the reliability and performance of our cloud-based platform.Key ResponsibilitiesDevelop and enhance CI/CD pipelines to streamline application deployment and...


  • San Francisco, California, United States DataRobot Full time

    Job Title: Director of Site Reliability EngineeringDataRobot is the leader in Value-Driven AI, a unique and collaborative approach to generative and predictive AI that combines an open platform, deep expertise, and broad use-case experience to improve how organizations run, grow, and optimize their business. The DataRobot AI Platform is the only complete AI...


  • San Francisco, California, United States Descript Full time

    About DescriptDescript is a cutting-edge technology company that's revolutionizing the way we create and edit audio and video content. We're a team of innovators who are passionate about harnessing the power of AI to make content creation faster, easier, and more accessible.Job Title: Engineering Manager - AI PlatformWe're seeking an experienced Engineering...

  • AI Engineering Lead

    1 month ago


    San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking an Engineering Director to spearhead our AI Platform division. This team is responsible for developing cutting-edge software solutions that drive the Snorkel Flow platform. The focus includes creating services for training and deploying generative AI and machine learning models, utilizing innovative data-centric methodologies,...

  • AI Engineering Lead

    1 month ago


    San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking a Director of Engineering to spearhead our AI Platform division. This team is responsible for developing cutting-edge software systems that enhance the Snorkel Flow platform. Responsibilities include creating services for training and deploying generative AI and machine learning models, utilizing innovative data-centric...

  • AI Engineering Lead

    1 month ago


    San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking an experienced Director of Engineering to spearhead our AI Platform division. This team is responsible for developing cutting-edge software systems that drive the Snorkel Flow platform. Key responsibilities include creating services for training and deploying generative AI and machine learning models utilizing innovative...


  • San Jose, California, United States Adobe Full time

    Job SummaryWe are seeking a highly skilled Senior Engineering Manager to lead the development of our AI Inference Platform at Adobe. As a key member of our team, you will be responsible for driving the architecture, design, development, and testing of the platform. Your primary goal will be to enable the Firefly Product Team to easily run and deploy ML...


  • San Francisco, California, United States Instabase Full time

    About InstabaseInstabase is a cutting-edge AI innovation company that empowers organizations to solve complex unstructured data problems. With a global presence and a customer-centric approach, we deliver top-tier solutions that provide unmatched advantages for everyday business operations.Job Title: Site Reliability EngineerWe are seeking a highly skilled...


  • San Francisco, California, United States Forsyth Barnes Full time

    Job Title: Site Reliability Engineering ManagerAbout the Role:Forsyth Barnes is seeking a highly skilled Site Reliability Engineering Manager to join our team. As a key member of our infrastructure team, you will be responsible for ensuring the reliability and performance of our cloud-based services.Key Responsibilities:Monitor and optimize cloud capacity...


  • San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking a Director of Engineering to oversee our AI Platform division. This team is responsible for developing cutting-edge software systems that drive the Snorkel Flow platform. Responsibilities include creating services for training and deploying generative AI and machine learning models, utilizing innovative data-centric...