Site Reliability Engineering Manager, AI Platform
4 weeks ago
About the Role:
We are seeking an experienced Site Reliability Engineering Manager to lead our AI Inference Platform team at Adobe. As a key member of our Engineering organization, you will be responsible for developing and implementing strategies to ensure the reliability, scalability, and security of our AI Platform.
Key Responsibilities:
* Develop and execute technical vision and roadmap for AI Platform Inference infrastructure
* Grow and lead a team of dedicated SRE engineers
* Engage with Firefly Engineering and Firefly App Integrations team to understand their needs and goals to drive the platform's reliability
* Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency
* Ensure the highest uptime and Quality of Service (QoS) for Adobe's customers through operational excellence
* Define service level objectives (SLOs) and indicators (SLIs) to represent and measure service quality
* Support and maintain globally distributed, multi-cloud (public and/or private) environments
* Automate common, repeatable tasks at a large scale to streamline operational procedures
* Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc.
* Coordinate with other Adobe platform teams and service providers (primarily AWS) to innovate on Generative AI as a Service
* Ensure inference services improve GPU utilization, scale models independently, and optimize COGs
Requirements:
* BS or MS degree in Computer Science, Electrical Engineering, a related field, or equivalent industry experience
* 3+ years of experience as an Engineering Manager
* Strong communication and collaboration skills - building strong relationships with internal customers and external partners
* Dedication to team-work, self-organization, and continuous improvement
* A track record of leading high-performance teams to deliver results in a fast-paced and dynamic environment of AI infrastructure
* Production level expertise with containerization orchestration engines (e.g. Kubernetes) and demonstrated understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git)
* Fundamental programming skills, ideally practical experience in one (and preferably more) of the following languages: Python, Go or Java
* An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions - familiarity with Pytorch, SageMaker, HuggingFace, NVIDIA TensorRT or OpenAI Triton a plus
What We Offer:
* Competitive compensation package
* Opportunity to work with a talented team of engineers
* Collaborative and dynamic work environment
* Professional growth and development opportunities
* Recognition and rewards for outstanding performance
Equal Employment Opportunity:
Adobe is an equal employment opportunity and affirmative action employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other applicable characteristics protected by law.
Accessibility:
Adobe aims to make its website and application process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, please email or call
-
San Jose, California, United States Adobe Full timeTransforming Digital Experiences with AdobeWe're a company that's passionate about empowering people to create beautiful and powerful digital experiences. Our mission is to give everyone the tools they need to design and deliver exceptional experiences across every screen.The OpportunityWe're seeking an exceptional Site Reliability Engineering Manager to...
-
Site Reliability Engineer, AI Platform Training
4 weeks ago
San Jose, California, United States Adobe Full timeJob Title: Site Reliability Engineer, AI Platform TrainingJob Summary: We are seeking a highly skilled Site Reliability Engineer to join our team at Adobe. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and security of our AI Platform.About the Role:* Identify and implement methodologies and solutions to...
-
Site Reliability Engineer for AI Platform
1 month ago
San Jose, California, United States Adobe Full timeAbout the RoleWe're seeking a highly skilled Site Reliability Engineer to join our team at Adobe, working on the AI Training Platform. As a key member of our team, you'll be responsible for ensuring the highest uptime and Quality of Service (QoS) for our customers.Key ResponsibilitiesDesign and implement methodologies to increase reliability, scalability,...
-
Site Reliability Engineer
4 weeks ago
San Jose, California, United States HireIO Inc Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at HireIO Inc. As a Site Reliability Engineer, you will be responsible for designing and developing solutions to automate the technical operations of large-scale systems, working closely with teams to improve stability from a Software Development Lifecycle...
-
Senior Software Engineer
4 weeks ago
San Jose, California, United States Coactive AI Full timeAt Coactive AI, we're revolutionizing the way businesses interact with visual content. As a Senior Software Engineer on our AI Applications team, you'll play a pivotal role in bridging the gap between customer success, product development, and engineering to deliver impactful AI-driven solutions.Leveraging our advanced Multimodal AI Platform (MAP), you'll...
-
Software Engineer
4 weeks ago
San Jose, California, United States Coactive AI Full timeUnlock the power of visual data with Coactive AI.As a Software Engineer on our AI Applications team, you will play a pivotal role in developing and maintaining RESTful microservices using Python and FastAPI.Leveraging our advanced Multimodal AI Platform (MAP), you'll bridge the gap between customer success, product development, and engineering to deliver...
-
Software Engineer
1 month ago
San Jose, California, United States Coactive AI Full timeUnlock the Power of AI with CoactiveCoactive is revolutionizing the way businesses harness the power of machine learning to unlock the potential of unstructured data. As a Software Engineer focused on AI solutions, you will be a key contributor to our Solutions team, bridging sales, customer success, product, and engineering.Your primary goal is to ensure...
-
Senior AI/ML Platform Manager
4 weeks ago
San Jose, California, United States PayPal Full timeAt PayPal, we're revolutionizing commerce globally, and we need a Senior AI/ML Platform Manager to help us scale our AI/ML infrastructure and platform.We're looking for a strong Senior Product Manager with a deep understanding of the AI/ML Platform stack and a strong business acumen to partner with Data Scientists and ML Engineers in delivering a...
-
San Francisco, California, United States TBWA\Chiat\Day Full timeJob Title:Senior Site Reliability Engineer with Perplexity AIJob Summary:We are seeking a highly skilled Senior Site Reliability Engineer to join our team at Perplexity AI. As a key member of our infrastructure team, you will be responsible for designing, implementing, and scaling our cloud infrastructure to support our AI-powered search...
-
Senior Product Manager, AI Platform
4 weeks ago
San Jose, California, United States Adobe Full timeJob Title: Senior Product Manager, AI PlatformAbout the Role:We are seeking a seasoned AI/ML product management leader to lead the platform providing responsible data and enabling training for our models. The ideal candidate is a seasoned AI/ML product management leader with experience empowering applied AI/ML researchers to deliver best-in-class...
-
Senior AI Engineer, Platform
4 weeks ago
San Jose, California, United States Adobe Full timeJob DescriptionWe are seeking a highly skilled Senior AI Engineer to join our team at Adobe. As a key member of our platform engineering team, you will be responsible for designing, developing, and maintaining robust AI/ML infrastructure solutions to support the training and deployment of large-scale AI models.Key Responsibilities:Design and develop scalable...
-
Site Reliability Engineer
4 weeks ago
San Francisco, California, United States Genmo Full timeJob DescriptionWe are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.As a Site Reliability Engineer at Genmo, you will be responsible for designing, implementing, and maintaining the infrastructure that powers our large generative AI models. You will work on...
-
Senior AI/ML Platform Manager
4 weeks ago
San Jose, California, United States PayPal, Inc. Full timeJob Title: Senior AI/ML Platform ManagerJob Summary:PayPal, Inc. is seeking a Senior AI/ML Platform Manager to lead the development and implementation of our AI/ML platform. The successful candidate will have a strong background in AI/ML and experience in managing cross-functional teams.Key Responsibilities:* Develop and execute a long-term strategy for the...
-
Software Engineer
4 weeks ago
San Jose, California, United States Coactive AI Full timeCoactive is revolutionizing the way businesses harness the power of machine learning to unlock the potential of unstructured data. We are seeking a highly skilled Software Engineer to join our Solutions team as an AI Solutions Expert.About the Role:As an AI Solutions Expert, you will be responsible for delivering AI-focused technical solutions with clear...
-
Cloud Platform Staff Site Reliability Engineer
1 month ago
San Francisco, California, United States Zilliz Full timeJob Title: Cloud Platform Staff Site Reliability EngineerWe are seeking a highly skilled Cloud Platform Staff Site Reliability Engineer to join our team at Zilliz. As a key member of our SRE team, you will be responsible for ensuring the reliability, availability, and performance of our distributed database systems.Key Responsibilities:Design and build tools...
-
AI Platform Engineer
4 weeks ago
San Francisco, California, United States Labelbox Full timeAbout the RoleLabelbox is seeking a skilled AI Platform Engineer to join our team. As a key member of our engineering organization, you will be responsible for building and maintaining a scalable AI platform that utilizes foundation models for real-world applications.Your Day to DayEnhance and improve Labelbox's core machine learning capabilities, including...
-
Senior Product Manager AI Platform
1 month ago
San Jose, California, United States Adobe Full timeTransforming Digital ExperiencesAt Adobe, we're passionate about empowering people to create beautiful and powerful digital experiences. We're on a mission to hire the best talent and create exceptional employee experiences where everyone is respected and has access to equal opportunity.The RoleWe're seeking a Senior Product Manager to lead the platform...
-
San Francisco, California, United States Together AI Full timeJob ResponsibilitiesInfrastructure Development:Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable AI/ML solutions.AI/ML Solutions:Develop advanced AI/ML infrastructure solutions to enhance the efficiency of our ML teams, leveraging expertise in distributed systems and large-scale data processing.System Design:Design and...
-
Senior AI Engineer, Platform
4 weeks ago
San Jose, California, United States Adobe Full timeJob SummaryWe are seeking a highly skilled Senior AI Engineer to join our team at Adobe. As a key member of our platform, you will be responsible for designing, developing, and maintaining robust AI/ML infrastructure solutions to support the training and deployment of large-scale AI models. Key ResponsibilitiesDesign and develop AI/ML infrastructure...
-
Site Reliability Engineer
1 month ago
San Leandro, California, United States Omni Inclusive Full timeJob Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Omni Inclusive. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and availability of our Digital Sales & Marketing platforms.Key Responsibilities:Collaborate with Engineering teams to maintain the...