Site Reliability Engineering Manager, AI Platform

4 weeks ago


San Jose, California, United States Adobe Full time
Job Title: Site Reliability Engineering Manager, AI Platform

About the Role:

We are seeking an experienced Site Reliability Engineering Manager to lead our AI Inference Platform team at Adobe. As a key member of our Engineering organization, you will be responsible for developing and implementing strategies to ensure the reliability, scalability, and security of our AI Platform.

Key Responsibilities:

* Develop and execute technical vision and roadmap for AI Platform Inference infrastructure
* Grow and lead a team of dedicated SRE engineers
* Engage with Firefly Engineering and Firefly App Integrations team to understand their needs and goals to drive the platform's reliability
* Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency
* Ensure the highest uptime and Quality of Service (QoS) for Adobe's customers through operational excellence
* Define service level objectives (SLOs) and indicators (SLIs) to represent and measure service quality
* Support and maintain globally distributed, multi-cloud (public and/or private) environments
* Automate common, repeatable tasks at a large scale to streamline operational procedures
* Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc.
* Coordinate with other Adobe platform teams and service providers (primarily AWS) to innovate on Generative AI as a Service
* Ensure inference services improve GPU utilization, scale models independently, and optimize COGs

Requirements:

* BS or MS degree in Computer Science, Electrical Engineering, a related field, or equivalent industry experience
* 3+ years of experience as an Engineering Manager
* Strong communication and collaboration skills - building strong relationships with internal customers and external partners
* Dedication to team-work, self-organization, and continuous improvement
* A track record of leading high-performance teams to deliver results in a fast-paced and dynamic environment of AI infrastructure
* Production level expertise with containerization orchestration engines (e.g. Kubernetes) and demonstrated understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git)
* Fundamental programming skills, ideally practical experience in one (and preferably more) of the following languages: Python, Go or Java
* An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions - familiarity with Pytorch, SageMaker, HuggingFace, NVIDIA TensorRT or OpenAI Triton a plus

What We Offer:

* Competitive compensation package
* Opportunity to work with a talented team of engineers
* Collaborative and dynamic work environment
* Professional growth and development opportunities
* Recognition and rewards for outstanding performance

Equal Employment Opportunity:

Adobe is an equal employment opportunity and affirmative action employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other applicable characteristics protected by law.

Accessibility:

Adobe aims to make its website and application process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, please email or call

  • San Jose, California, United States Adobe Full time

    Transforming Digital Experiences with AdobeWe're a company that's passionate about empowering people to create beautiful and powerful digital experiences. Our mission is to give everyone the tools they need to design and deliver exceptional experiences across every screen.The OpportunityWe're seeking an exceptional Site Reliability Engineering Manager to...


  • San Jose, California, United States Adobe Full time

    Job Title: Site Reliability Engineer, AI Platform TrainingJob Summary: We are seeking a highly skilled Site Reliability Engineer to join our team at Adobe. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and security of our AI Platform.About the Role:* Identify and implement methodologies and solutions to...


  • San Jose, California, United States Adobe Full time

    About the RoleWe're seeking a highly skilled Site Reliability Engineer to join our team at Adobe, working on the AI Training Platform. As a key member of our team, you'll be responsible for ensuring the highest uptime and Quality of Service (QoS) for our customers.Key ResponsibilitiesDesign and implement methodologies to increase reliability, scalability,...


  • San Jose, California, United States HireIO Inc Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at HireIO Inc. As a Site Reliability Engineer, you will be responsible for designing and developing solutions to automate the technical operations of large-scale systems, working closely with teams to improve stability from a Software Development Lifecycle...


  • San Jose, California, United States Coactive AI Full time

    At Coactive AI, we're revolutionizing the way businesses interact with visual content. As a Senior Software Engineer on our AI Applications team, you'll play a pivotal role in bridging the gap between customer success, product development, and engineering to deliver impactful AI-driven solutions.Leveraging our advanced Multimodal AI Platform (MAP), you'll...

  • Software Engineer

    4 weeks ago


    San Jose, California, United States Coactive AI Full time

    Unlock the power of visual data with Coactive AI.As a Software Engineer on our AI Applications team, you will play a pivotal role in developing and maintaining RESTful microservices using Python and FastAPI.Leveraging our advanced Multimodal AI Platform (MAP), you'll bridge the gap between customer success, product development, and engineering to deliver...

  • Software Engineer

    1 month ago


    San Jose, California, United States Coactive AI Full time

    Unlock the Power of AI with CoactiveCoactive is revolutionizing the way businesses harness the power of machine learning to unlock the potential of unstructured data. As a Software Engineer focused on AI solutions, you will be a key contributor to our Solutions team, bridging sales, customer success, product, and engineering.Your primary goal is to ensure...


  • San Jose, California, United States PayPal Full time

    At PayPal, we're revolutionizing commerce globally, and we need a Senior AI/ML Platform Manager to help us scale our AI/ML infrastructure and platform.We're looking for a strong Senior Product Manager with a deep understanding of the AI/ML Platform stack and a strong business acumen to partner with Data Scientists and ML Engineers in delivering a...


  • San Francisco, California, United States TBWA\Chiat\Day Full time

    Job Title:Senior Site Reliability Engineer with Perplexity AIJob Summary:We are seeking a highly skilled Senior Site Reliability Engineer to join our team at Perplexity AI. As a key member of our infrastructure team, you will be responsible for designing, implementing, and scaling our cloud infrastructure to support our AI-powered search...


  • San Jose, California, United States Adobe Full time

    Job Title: Senior Product Manager, AI PlatformAbout the Role:We are seeking a seasoned AI/ML product management leader to lead the platform providing responsible data and enabling training for our models. The ideal candidate is a seasoned AI/ML product management leader with experience empowering applied AI/ML researchers to deliver best-in-class...


  • San Jose, California, United States Adobe Full time

    Job DescriptionWe are seeking a highly skilled Senior AI Engineer to join our team at Adobe. As a key member of our platform engineering team, you will be responsible for designing, developing, and maintaining robust AI/ML infrastructure solutions to support the training and deployment of large-scale AI models.Key Responsibilities:Design and develop scalable...


  • San Francisco, California, United States Genmo Full time

    Job DescriptionWe are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.As a Site Reliability Engineer at Genmo, you will be responsible for designing, implementing, and maintaining the infrastructure that powers our large generative AI models. You will work on...


  • San Jose, California, United States PayPal, Inc. Full time

    Job Title: Senior AI/ML Platform ManagerJob Summary:PayPal, Inc. is seeking a Senior AI/ML Platform Manager to lead the development and implementation of our AI/ML platform. The successful candidate will have a strong background in AI/ML and experience in managing cross-functional teams.Key Responsibilities:* Develop and execute a long-term strategy for the...

  • Software Engineer

    4 weeks ago


    San Jose, California, United States Coactive AI Full time

    Coactive is revolutionizing the way businesses harness the power of machine learning to unlock the potential of unstructured data. We are seeking a highly skilled Software Engineer to join our Solutions team as an AI Solutions Expert.About the Role:As an AI Solutions Expert, you will be responsible for delivering AI-focused technical solutions with clear...


  • San Francisco, California, United States Zilliz Full time

    Job Title: Cloud Platform Staff Site Reliability EngineerWe are seeking a highly skilled Cloud Platform Staff Site Reliability Engineer to join our team at Zilliz. As a key member of our SRE team, you will be responsible for ensuring the reliability, availability, and performance of our distributed database systems.Key Responsibilities:Design and build tools...

  • AI Platform Engineer

    4 weeks ago


    San Francisco, California, United States Labelbox Full time

    About the RoleLabelbox is seeking a skilled AI Platform Engineer to join our team. As a key member of our engineering organization, you will be responsible for building and maintaining a scalable AI platform that utilizes foundation models for real-world applications.Your Day to DayEnhance and improve Labelbox's core machine learning capabilities, including...


  • San Jose, California, United States Adobe Full time

    Transforming Digital ExperiencesAt Adobe, we're passionate about empowering people to create beautiful and powerful digital experiences. We're on a mission to hire the best talent and create exceptional employee experiences where everyone is respected and has access to equal opportunity.The RoleWe're seeking a Senior Product Manager to lead the platform...


  • San Francisco, California, United States Together AI Full time

    Job ResponsibilitiesInfrastructure Development:Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable AI/ML solutions.AI/ML Solutions:Develop advanced AI/ML infrastructure solutions to enhance the efficiency of our ML teams, leveraging expertise in distributed systems and large-scale data processing.System Design:Design and...


  • San Jose, California, United States Adobe Full time

    Job SummaryWe are seeking a highly skilled Senior AI Engineer to join our team at Adobe. As a key member of our platform, you will be responsible for designing, developing, and maintaining robust AI/ML infrastructure solutions to support the training and deployment of large-scale AI models. Key ResponsibilitiesDesign and develop AI/ML infrastructure...


  • San Leandro, California, United States Omni Inclusive Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Omni Inclusive. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and availability of our Digital Sales & Marketing platforms.Key Responsibilities:Collaborate with Engineering teams to maintain the...