Site Reliability Engineer, AI Platform Training

6 days ago


San Jose, California, United States Adobe Full time
About the Role

We are seeking an exceptional Site Reliability Engineer to join our team at Adobe, working on the AI Training Platform, Adobe Firefly. As a key member of our team, you will collaborate closely with Engineering teams to build, scale, and secure the AI Platform, enabling Firefly product teams to easily manage and deploy Machine Learning capabilities used by Adobe client applications.

Key Responsibilities
  • Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency.
  • Ensure the highest uptime and Quality of Service (QoS) for Adobe's customers through operational excellence.
  • Define service level objectives (SLOs) and indicators (SLIs) to represent and measure service quality.
  • Support and maintain globally distributed, multi-cloud (public and/or private) environments.
  • Automate common, repeatable tasks at a large scale to streamline operational procedures.
  • Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc.
  • Coordinate with other Adobe platform teams and service providers (primarily AWS) to innovate on Generative AI as a Service.
Requirements
  • A Bachelor's or Master's degree in Computer Science, Electrical Engineering, a related field, and 5+ years relevant industry experience.
  • Experience in building and scaling distributed systems, as well as experience with containerization and orchestration technologies like Kubernetes.
  • Production level expertise with containerization orchestration engines (e.g. Kubernetes) and proven understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git)
  • Fundamental programming skills, ideally practical experience in one (and preferably more) of the following languages: Python, Go
  • Good knowledge of infrastructure configuration management tools like Ansible and Terraform.
  • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic Stack.
  • An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions - familiarity with Pytorch, SageMaker, HuggingFace, NVIDIA TensorRT or OpenAI Triton a plus.
What We Offer

At Adobe, we offer a competitive compensation package, including a base salary and short-term incentives. Certain roles may be eligible for long-term incentives in the form of a new hire equity award. We are an equal opportunity employer and welcome applicants from diverse backgrounds. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, please email or call (408).

We are committed to creating a workplace where everyone is respected and has access to equal opportunity. We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours.



  • San Jose, California, United States Adobe Full time

    About the RoleWe are seeking an exceptional Site Reliability Engineering Manager to lead our team in driving reliability for Adobe's AI Inference Platform, Adobe Firefly. As a key member of our Engineering organization, you will be responsible for developing a team of Site Reliability Engineers who will work closely with our Engineering teams to build,...


  • San Jose, California, United States Adobe Full time

    About the RoleWe are seeking an exceptional Site Reliability Engineering Manager to lead our team in driving reliability for Adobe's AI Inference Platform, Adobe Firefly. As a key member of our Engineering organization, you will be responsible for developing a team of Site Reliability Engineers who will work closely with our Engineering teams to build,...


  • San Jose, California, United States Adobe Full time

    About the RoleWe're seeking an exceptional Site Reliability Engineering Manager to lead our AI Platform Inference Infrastructure team at Adobe. As a key member of our organization, you'll be responsible for driving reliability, scalability, and security for our AI Inference Platform, Adobe Firefly.Key ResponsibilitiesDevelop and execute the technical vision...


  • San Jose, California, United States Trianz Full time

    About TrianzTrianz is a leading-edge technology platforms and services company that accelerates digital transformations at Fortune 100 and emerging companies worldwide in data & analytics, digital experiences, cloud infrastructure, and security.Our VisionWe believe that companies around the world face three challenges in their digital transformation journeys...


  • San Francisco, California, United States Perplexity AI Full time

    Site Reliability EngineerPerplexity AI is seeking a skilled Site Reliability Engineer to join our team and contribute to the development of our cutting-edge conversational answer engine.As a Site Reliability Engineer, you will be responsible for designing, implementing, and scaling the infrastructure and systems that support our web and mobile products.Key...


  • San Jose, California, United States Trianz Full time

    About TrianzTrianz is a leading-edge technology platforms and services company that accelerates digital transformations at Fortune 100 and emerging companies worldwide in data & analytics, digital experiences, cloud infrastructure, and security.Our VisionWe believe that companies around the world face three challenges in their digital transformation journeys...


  • San Francisco, California, United States Mistral AI Full time

    About Mistral AIMistral AI is a cutting-edge technology company dedicated to making AI ubiquitous and open. Our mission is to drive innovation and excellence in the field of artificial intelligence. We are a tight-knit, nimble team that thrives in a competitive environment, and we're passionate about AI.Job SummaryWe're seeking an experienced Cloud...


  • San Francisco, California, United States Snorkel AI, Inc. Full time

    About the RoleWe are seeking an experienced Engineering Manager to lead our AI Platform team at Snorkel AI, Inc. This is a unique opportunity to join a cutting-edge technology company and contribute to the development of innovative AI solutions.Key ResponsibilitiesLead a team of talented engineers to design, develop, and deploy large-scale data-focused AI...


  • San Jose, California, United States Tik Tok Full time

    Job Title: Site Reliability Engineer, Data PlatformTikTok is a leading destination for short-form mobile video, and our mission is to inspire creativity and bring joy. Our platform is built to help imaginations thrive, and we're looking for a Site Reliability Engineer to join our Data Platform team.Responsibilities:Ensure the reliability of all TikTok's...


  • San Diego, California, United States Platform Science Full time

    About UsAt Platform Science, we're revolutionizing the way businesses connect and interact with the world around them. Our open IoT platform empowers innovative fleets, application developers, and equipment providers to deliver cutting-edge solutions to supply chain professionals globally.The RoleWe're seeking a highly skilled Senior Site Reliability...


  • San Diego, California, United States Platform Science Full time

    About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our team in San Diego, CA (or remote). As a key member of our SRE team, you will be responsible for ensuring the reliability and performance of our cloud-based platform.Key ResponsibilitiesDevelop and enhance CI/CD pipelines to streamline application deployment and...

  • AI Engineering Lead

    1 month ago


    San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking an Engineering Director to spearhead our AI Platform division. This team is responsible for developing cutting-edge software solutions that drive the Snorkel Flow platform. The focus includes creating services for training and deploying generative AI and machine learning models, utilizing innovative data-centric methodologies,...

  • AI Engineering Lead

    1 month ago


    San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking a Director of Engineering to spearhead our AI Platform division. This team is responsible for developing cutting-edge software systems that enhance the Snorkel Flow platform. Responsibilities include creating services for training and deploying generative AI and machine learning models, utilizing innovative data-centric...

  • AI Engineering Lead

    1 month ago


    San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking an experienced Director of Engineering to spearhead our AI Platform division. This team is responsible for developing cutting-edge software systems that drive the Snorkel Flow platform. Key responsibilities include creating services for training and deploying generative AI and machine learning models utilizing innovative...


  • San Francisco, California, United States DataRobot Full time

    Job Title: Director of Site Reliability EngineeringDataRobot is the leader in Value-Driven AI, a unique and collaborative approach to generative and predictive AI that combines an open platform, deep expertise, and broad use-case experience to improve how organizations run, grow, and optimize their business. The DataRobot AI Platform is the only complete AI...


  • San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking a Director of Engineering to spearhead our AI Platform division. This team is responsible for creating cutting-edge software solutions that enhance the Snorkel Flow platform. Responsibilities include developing services for training and deploying generative AI and machine learning models, utilizing innovative data-centric...


  • San Jose, California, United States Tik Tok Full time

    Job Title: Site Reliability Engineer, Cloud Native PlatformTikTok is a leading destination for short-form mobile video, inspiring creativity and bringing joy to users worldwide. Our mission is to connect people across the globe, and our infrastructure team is seeking experienced site reliability engineers to build a globally distributed edge platform for...


  • San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking a Director of Engineering to oversee our AI Platform division. This team is responsible for developing cutting-edge software systems that drive the Snorkel Flow platform. Responsibilities include creating services for training and deploying generative AI and machine learning models, utilizing innovative data-centric...


  • San Francisco, California, United States Snorkel AI, Inc. Full time

    Position OverviewWe are seeking a Director of Engineering to oversee our AI Platform division. This team is responsible for developing cutting-edge software systems that enhance the Snorkel Flow platform. The focus includes creating services for training and deploying generative AI and machine learning models utilizing advanced data-centric methodologies,...


  • San Francisco, California, United States Instabase Full time

    About InstabaseInstabase is a cutting-edge AI innovation company that empowers organizations to solve complex unstructured data problems. With a global presence and a customer-centric approach, we deliver top-tier solutions that provide unmatched advantages for everyday business operations.Job Title: Site Reliability EngineerWe are seeking a highly skilled...