Site Reliability Engineer

6 days ago

San Diego, California, United States Apple Full time

The Video Computer Vision organization is working on exciting technologies for future Apple products. Our focus is on ML based solution around real time image and video. We have contributed to the FaceID and FaceKit project in the past and more recently the new LIDAR iPad sensor. We are looking for the right Site Reliability Engineer to help us take our efforts to the next level.

In this role, you will help lead our cloud based infrastructure team for Apple's Video Computer Vision Organization. As a main contributor to our SRE team you will develop and maintain infrastructure, tooling, and engineering services for cloud based applications. You will be responsible for system bringup, deployment, reliability, security and service scalability. This role is highly cross-functional and you will work very closely with various highly skilled software development / ML teams developing cutting edge algorithms.

Description
Your core responsibility is to provide operational support of multiple cloud based applications with an emphasis on deployment, security, scalability and reliability running on AWS and Apple infrastructure. Our technologies include Terraform, Argo, Docker, Python, Postgres, Prometheus, in combination with custom Apple software and tooling. Common technologies you'll manage include: Kubernetes (eks), Elasticsearch, Redis, RDS, ELB, and other AWS based services. This role will also help drive solutions for hybrid infrastructure (on and off prem) and drive infrastructure architecture for our AWS based cloud platform.

Minimum Qualifications
Experience building systems both on-premise (data center) and on public cloud (AWS, GCP or Azure welcome)
Have deployed and operated schedulers such as Kubernetes, AWS ECS or EKS
Ability to write code in one of many high level languages (Python preferred)
BS and a minimum of 3 years relevant industry experience

Preferred Qualifications
MS in Computer Science/Computer Engineering (or equivalent experience)
5+ years supporting large scale in production applications in an SRE role
3+ years managing SRE teams and supporting mission critical applications
3+ years of Hybrid Cloud infrastructure management
Experience with AWS large-scale application deployment and service management through Terraform, Argo, or similar
Expert knowledge of Linux, Python, Docker, Kubernetes, Postgres, Redis, along with operations and monitoring
Professorial approach to working with team members, teaching best practices and leveling up the engineers around you
Be seen as a leader among software development teams, championing collaboration and shared ownership in technology decisions and knowledge transfer within the team
Expertise in networking with an emphasis on security
Working knowledge of deploying microservices and working experience on strategies to support Apple's scale
Vast experience using Linux with knowledge of kernel/system tuning
Last but not least, you are battle-tested and have a few interesting production tales

Site Reliability Engineer

20 hours ago

San Diego, California, United States Apple Full time

At Apple, our Data Analytics team focuses on improving the user experience by improving operating system stability, gathering feature usage telemetry, and evaluating device performance. This requires capturing data from customers who have given consent, utilizes strong privacy preserving techniques, and entails aggregating information, all to help inform...
Associate Site Reliability Engineer

2 weeks ago

San Diego, California, United States SHEIN Technology LLC Full time

About SHEINSHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore, with more than 15,000 employees operating from offices around the world, SHEIN is committed to making the beauty of fashion accessible to all, promoting its...
Site Reliability Engineer

2 days ago

San Francisco, California, United States TalentPartners Full time

TempDescription:Role Details:CLIENT is seeking a Site Reliability Engineer for our online television and media-focused webproperties. In this role, you will be building systems that support the lifecycle and visibility of sites which haveglobal reach and scale. We aim to monitor everything and to automate everythingYour Day-to-Day:? Write...
Site Reliability Engineer

21 hours ago

San Diego, California, United States ServiceNow Full time

Company DescriptionIt all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based...
Site Reliability Engineer

21 hours ago

San Francisco, California, United States Air Apps Full time

About Air AppsAt Air Apps, we believe in thinking bigger—and moving faster. We're a family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP), and we need your passion and ambition to help us change how people plan, work, and live. Born in Lisbon, Portugal, in 2018—and now with offices in...
Senior Site Reliability Engineer

1 week ago

San Francisco, California, United States LanceDB Full time

About LanceDBLanceDB is a developer-friendly, open-source data lake for multimodal AI. From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of large-scale AI datasets, LanceDB is the best foundation for your AI application, and powers some of the most groundbreaking applications and...
Senior Site Reliability Engineer

2 weeks ago

San Francisco, California, United States LanceDB Full time

About LanceDBLanceDB is a developer-friendly, open-source data lake for multimodal AI. From hyper-scalable vector search to advanced retrieval for RAG, from streaming training data to interactive exploration of large-scale AI datasets, LanceDB is the best foundation for your AI application, and powers some of the most groundbreaking applications and...
Infrastructure Site Reliability Engineer

2 weeks ago

San Francisco, California, United States Maxonic Inc. Full time $120,000 - $180,000 per year

Maxonic maintains a close and long-term relationship with our direct client. In support of their needs, we are looking for anInfrastructure Site Reliability EngineerJob Description:Job Title: Infrastructure Site Reliability EngineerJob Type: Contract (4+ months) with strong possibility to convert to fulltimeJob Location: San Francisco, CAWork Schedule:...
Senior Site Reliability Engineer

2 weeks ago

San Francisco, California, United States Alembic Full time

About the RoleWe're looking for an experienced Site Reliability Engineer (SRE) to help us scale our platform with reliability, observability, and operational excellence at the core. You'll partner with engineers and data scientists to build, automate, and maintain the infrastructure that powers our core platform—including data pipelines, ML workloads, and...
Principal Site Reliability Engineer

1 week ago

San Francisco, California, United States Virtasant Full time

Location/Time zone requirements:Must be based in the San Francisco Bay Area, with weekly visits to the client's headquarters.About VirtasantVirtasant is a fast-growing global consultancy transforming how technology services are delivered. We are a diverse team of cloud experts, builders, and operators. Since 2006, we've helped large enterprises thrive in the...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer