Cloud Infrastructure
1 week ago
Alibaba Cloud Native Message Middleware Team is responsible for message products, including RocketMQ and other messaging products. We are committed to creating a more stable, user-friendly, streaming, and large-scale messaging platform for the future.
Cloud Product Operations & Reliability
Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/RocketMQ).
Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments.
Incident Response & Root Cause Analysis
Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.
Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges.
Automation & Operational Excellence
Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows.
Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.
Strong scripting skills in Shell/Python and experience with Infrastructure as Code (IaC) tools (Terraform preferred).
Position Requirement
Minimum Qualification:
Experience: Over 2 years of experience in distributed systems reliability engineering, familiar with high-availability architecture design, and proficient in at least one of Python, Go, or Java.
Messaging: Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ.
Hands-on experience deploying middleware on Kubernetes (Helm/Operator preferred).
Automation: Ability to convert operations experience into automated solutions and familiarity with various message middleware, e.g., Kafka and RocketMQ.
Preferred Qualification:
SRE Practices: Familiar with core SRE practices (incident review, error budgeting, chaos engineering) and experienced in building automated risk control systems.
The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.
If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.
-
Cloud Infrastructure
3 weeks ago
Sunnyvale, California, United States Alibaba Cloud Full timeJob DescriptionAlibaba Cloud Native Observability Team: Responsible for observability products including Alibaba Cloud Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Cloud Monitoring Service (CMS). We are committed to creating a real-time, intelligent, and large-scale observation and analysis platform for the future. This platform...
-
Cloud Infrastructure
3 weeks ago
Sunnyvale, California, United States Alibaba Cloud Full timeJob DescriptionAlibaba Cloud Native Observability Team: Responsible for observability products including Alibaba Cloud Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Cloud Monitoring Service (CMS). We are committed to creating a real-time, intelligent, and large-scale observation and analysis platform for the future. This platform...
-
Cloud Infrastructure SRE-Sunnyvale
3 weeks ago
Sunnyvale, California, United States Alibaba Cloud Full timeJob DescriptionAlibaba Cloud Native Observability Team: Responsible for observability products including Alibaba Cloud Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Cloud Monitoring Service (CMS). We are committed to creating a real-time, intelligent, and large-scale observation and analysis platform for the future. This platform...
-
Cloud Infrastructure SRE-Sunnyvale
3 weeks ago
Sunnyvale, California, United States Alibaba Cloud Full timeJob DescriptionAlibaba Cloud Native Observability Team: Responsible for observability products including Alibaba Cloud Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Cloud Monitoring Service (CMS). We are committed to creating a real-time, intelligent, and large-scale observation and analysis platform for the future. This platform...
-
Cloud Infrastructure Developer
20 hours ago
Sunnyvale, California, United States beBee Careers Full timeAbout the JobThis is an exciting opportunity for a Cloud Infrastructure Developer to join our team. As a Cloud Infrastructure Developer, you will be responsible for designing and developing cloud-based infrastructure that supports our mission-critical applications. Your expertise in cloud technologies, such as Microsoft Azure and Google GCP, will be...
-
Cloud Infrastructure Architect
4 days ago
Sunnyvale, California, United States beBee Careers Full timeCloud Infrastructure Architect Job Description:We are looking for a highly experienced Cloud Infrastructure Architect to join our team. The successful candidate will have expertise in designing and implementing scalable and secure cloud-based systems.Key Responsibilities:Designing and implementing cloud-based infrastructure systems using GCP, AWS, and...
-
Cloud Infrastructure Specialist
6 days ago
Sunnyvale, California, United States beBee Careers Full timeJob Description: We are seeking an experienced Cloud Infrastructure Specialist to join our team. As a key member of our infrastructure team, you will be responsible for designing, building, and maintaining large-scale cloud-based systems. Your expertise in cloud computing platforms like AWS will be instrumental in driving the success of our projects.Key...
-
Cloud Infrastructure Specialist
2 days ago
Sunnyvale, California, United States beBee Careers Full timeAWS Python This role involves working as a cloud infrastructure specialist to design and deploy scalable cloud-based systems using AWS services. The ideal candidate will have hands-on experience with AWS Lambda, Glue, Athena, and Step functions. Key Responsibilities: Designing and deploying cloud-based applications on AWS Clouds Maintaining full application...
-
Cloud Native/Middleware Reliability Engineer
8 hours ago
Sunnyvale, California, United States Alibaba Cloud Full timeJoin to apply for the Cloud Native/Middleware Reliability Engineer (SRE)-Middleware role at Alibaba Cloud .Job DescriptionThe Alibaba Cloud Cloud-Native Middleware team is responsible for the research and development of distributed software infrastructure, delivering API Gateway and microservices solutions to enterprise customers, accelerating cloud...
-
Cloud Infrastructure Manager
23 hours ago
Sunnyvale, California, United States beBee Careers Full timeAbout the RoleWe are seeking a highly skilled Cloud Infrastructure Manager to oversee the administration of our Microsoft Azure environment. As a key member of our IT team, you will be responsible for ensuring the security, reliability, and performance of our cloud infrastructure.The ideal candidate will have at least 8 years of experience in managing cloud...