Cloud Infrastructure – Site Reliability Engineer

4 days ago

Sunnyvale CA, United States Alibaba Cloud Full time

Alibaba Cloud Native Message Middleware Team is responsible for message products, including RocketMQ and other messaging products. We are committed to creating a more stable, user-friendly, streaming, and large-scale messaging platform for the future.

Cloud Product Operations & Reliability

Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/RocketMQ).

Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments.

Incident Response & Root Cause Analysis

Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.

Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges.

Automation & Operational Excellence

Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows.

Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.

Strong scripting skills in Shell/Python and experience with Infrastructure as Code (IaC) tools (Terraform preferred).

Minimum qualification:

Experience: Over 2 years of experience in distributed systems reliability engineering, familiar with high-availability architecture design, and proficient in at least one of Python, Go, or Java.

Messaging: Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ.

Hands-on experience deploying middleware on Kubernetes (Helm/Operator preferred).

Automation: Ability to convert operations experience into automated solutions and familiarity with various message middleware, e.g., Kafka and RocketMQ.

Preferred Qualification:

SRE Practices: Familiar with core SRE practices (incident review, error budgeting, chaos engineering) and experienced in building automated risk control systems.

The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Cloud Platform Site Reliability Engineer

1 week ago

Sunnyvale, CA, United States Alibaba Cloud Full time $104,400 - $171,000 per year

Mission of the Cloud Intelligence Group SRE TeamThe mission of the Cloud Intelligence Group SRE (Site Reliability Engineering) Team is to ensure the stability of production environments, enterprise-grade cloud data reliability, and service continuity for the Cloud Intelligence Group. Our greatest challenge lies in guaranteeing uninterrupted business...
Cloud Native/Serverless Reliability Engineer

4 weeks ago

Sunnyvale, United States Alibaba Cloud Full time

Cloud Native/Serverless Reliability Engineer (SRE)Join to apply for the Cloud Native/Serverless Reliability Engineer (SRE) role at Alibaba Cloud.Job OverviewThe Alibaba Cloud Cloud Native Serverless Team is a leading innovation force within Alibaba Cloud, dedicated to empowering developers and enterprises with cutting-edge serverless technologies. Focused on...
Site Reliability Engineer

2 weeks ago

Sunnyvale, United States Cypress HCM Full time

Site Reliability EngineerAs a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principles to automate operational tasks, monitor system health, and participate in incident response. This role is execution-focused, supporting...
Senior Site Reliability Engineer

3 weeks ago

Sunnyvale, United States ISC2 East Bay Chapter Full time

A technology organization in Sunnyvale, CA is looking for a Site Reliability Engineer to enhance the efficiency and reliability of its cloud security platform. The role requires a strong background in DevOps and experience with infrastructure as code. The ideal candidate will automate workflows, monitor performance, and work closely with cross-functional...
Staff Site Reliability Engineer

5 days ago

Sunnyvale, United States Illumio Full time

This role will be onsite in Sunnyvale, CA HQ five days a week.We are looking for an experienced Senior Site Reliability Engineer (SRE) with a strong background in Azure cloud platform to play a key role in ensuring the reliability, scalability, and performance of our cloud-based systems and applications.The ideal candidate will have hands-on experience in...
Infrastructure Site Reliability Engineer

2 weeks ago

San Francisco, CA, United States Maxonic Inc. Full time

Maxonic maintains a close and long-term relationship with our direct client. In support of their needs, we are looking for an Infrastructure Site Reliability Engineer Job Description: Job Title: Infrastructure Site Reliability Engineer Job Type: Contract (4+ months) with strong possibility to convert to fulltime Job Location: San Francisco, CA Work Schedule:...
Senior Systems Engineer, Site Reliability Engineering, Google Cloud

3 weeks ago

Sunnyvale, United States Google Full time

Senior Systems Engineer, Site Reliability Engineering, Google Cloud Join to apply for the Senior Systems Engineer, Site Reliability Engineering, Google Cloud role at Google. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault‑tolerant systems. SRE ensures that...
Site Reliability Engineer – $150,000-$195,000

3 weeks ago

Sunnyvale, United States ISC2 East Bay Chapter Full time

Site Reliability Engineer – $150,000–$195,000ISC2 East Bay Chapter – Sunnyvale, CA, United StatesJoin our collaborative environment where engineers solve complex problems and secure cloud and container environments for thousands of customers worldwide.Role SummaryAs a Site Reliability Engineer at Fortinet, you will design, build, and improve the...
Senior Systems Engineer, Site Reliability Engineering, Cloud

3 days ago

Sunnyvale, CA, United States Google Full time

Minimum qualifications:Bachelor's degree in Computer Science, a related field, or equivalent practical experience.5 years of experience with programming in one or more programming languages.3 years of experience designing, analyzing, and troubleshooting distributed systems and working with administration (e.g. filesystems, inodes, system calls) or networking...
Senior Site Reliability Engineer

4 weeks ago

Palo Alto, CA, United States Mumba Technologies, Inc. Full time

About the Role Find out exactly what skills, experience, and qualifications you will need to succeed in this role before applying below. We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery...

Americas

Europe

Asia / Oceania

Africa

Cloud Infrastructure – Site Reliability Engineer