Senior Site Reliability Engineer

4 weeks ago

Palo Alto, United States SHEIN Technology LLC Full time

Job Title: Senior Site Reliability Engineer I

Reports to: Senior Manager of Site Reliability Engineering

Job Location: Palo Alto, CA, USA

Job Status: Exempt, FT

Candidates should take the time to read all the elements of this job advert carefully Please make your application promptly.

About SHEIN

SHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore, with more than 15,000 employees operating from offices around the world, SHEIN is committed to making the beauty of fashion accessible to all, promoting its industry-leading, on-demand production methodology, for a smarter, future-ready industry.

Position Summary

We are looking for a Senior Site Reliability Engineer - Big Data (Official Title: Senior Site Reliability Engineer I) for our Palo Alto, CA-based office hub. Site Reliability Engineers work with the Technical Operations team at SHEIN and are hybrid software/systems engineers, whose overarching goal is to ensure that Production Services are "Always On." They strive to build the most reliable and performant systems on the planet.

SREs work closely cross-functional teams to ensure we have the right set of tools to generate, collect, analyze, visualize and alert on operational data, so we know exactly what happens across the ecosystem and can see problems before they occur and address them as quickly as possible.

They are also responsible for improving Operational Efficiency, Utilization and System Resiliency of the Platform. They own Critical Open-Source Software that our platform relies on and are core participants in every significant engineering effort underway in the platform.

They are also tasked with driving forward the operability of the platform to drive down the number of incidents while reducing MTTR. To accomplish this, the team combines software development, networking and systems engineering expertise, and a strong desire to be challenged by problems of scale and complexity to make our service better for our customers.

Job Responsibilities

Participate in an on-call rotation to ensure 24/7/365 availability of SHEIN's production system
Supervise capacity & utilization and work closely with cross-functional teams to orchestrate scale-up/down of the services
Own & operate critical open-source services like Elasticsearch, Kafka, RabbitMQ, Redis
Build tools and design processes that help improve observability and system resiliency of the platform
Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents
Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators
Establish design patterns for monitoring, benchmarking and deploying new features for the backend services
Develop and maintain technical documentation, network diagrams, runbooks, and procedures
Driving initiatives to evolve our current platform to increase efficiency and keep it in line with current standards and best practices
Responding to production incidents and using your experience in software development, systems engineering, and networking to proactively prevent repeatable issues
Provide relief and sustainable resolution to issues within our infrastructure
Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design.
Join a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solutions.
Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability.

Job Requirements

Bachelor's degree in Computer Science, Information Systems, or equivalent technical discipline is preferred
Experience with Big Data related component operation and maintenance, including Hadoop, Yarn, HBase, Hive, Spark, etc., is highly preferred
Experience with OSS technologies, like Elasticsearch, Kafka, and Redis, is highly preferred
Solid understanding of Linux system is preferred
Minimum 3 years working experience in an enterprise 24/7 production environment supporting mission-critical, real-time, high-traffic applications, especially in cloud environments is preferred
Systematic problem-solving approach, combined with a sense of ownership and drive
Full-stack debugging and performance optimization ability, including knowledge of Cloud systems (load balancing, caching, content distribution, etc.), continuous integration/build systems, Java, SQL and NoSQL databases
Track record monitoring and analyzing system performance, isolating issues or bottlenecks that could impact reliability, performance and scalability
Strong experience with observability tools such as Grafana, Prometheus, Zabbix etc
Good experience in any of the scripting/programming languages: Python, GoLang etc
Familiar with container technology, such as: Docker, Kubernetes, Mesos, etc.
Understanding and experience with SRE concepts and practices, including being an advocate for the elimination of toil and drive simple solutions
Good verbal and written communication skills, and be able to work effectively with geographically remote teams

Pay

$107,600.00 min - $180,200.00 max annually, Bonus & RSU offered.

Benefits and Perks

Healthcare (medical, dental, vision, prescription drugs)

Health Savings Account with Employer Funding

Flexible Spending Accounts (Healthcare and Dependent care)

Company-Paid Basic Life/AD&D insurance

Company-Paid Short-Term and Long-Term Disability

Voluntary Benefit Offerings (Voluntary Life/AD&D, Hospital Indemnity, Critical Illness, and Accident)

Employee Assistance Program

Business Travel Accident Insurance

401(k) Savings Plan with discretionary company match and access to a financial advisor

Vacation, paid holidays, floating holiday and sick days

Employee discounts

Free weekly catered lunch

Dog-friendly office (available at select locations)

Free gym access (available at select locations)

Free swag giveaways

Annual Holiday Party

Invitations to pop-ups and other company events

Complimentary daily office snacks and beverages

SHEIN Technology LLC is an equal opportunity employer committed to a diverse workplace environment.

Senior Site Reliability Engineer

2 weeks ago

Palo Alto, California, United States SHEIN Technology LLC Full time

About the jobJob Title: Senior Site Reliability Engineer IReports to: Senior Manager of Site Reliability EngineeringJob Location: Palo Alto, CA, USAJob Status: Exempt, FT About SHEIN SHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in...
Senior Site Reliability Engineer

4 weeks ago

Palo Alto, California, United States SHEIN Technology LLC Full time

About the jobJob Title: Senior Site Reliability Engineer IReports to: Senior Manager of Site Reliability EngineeringJob Location: Palo Alto, CA, USAJob Status: Exempt, FT About SHEIN SHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in...
Senior Site Reliability Engineer

4 weeks ago

Palo Alto, United States SHEIN Technology LLC Full time

Job Title: Senior Site Reliability Engineer IReports to: Senior Manager of Site Reliability EngineeringJob Location: Palo Alto, CA, USAJob Status: Exempt, FTAbout SHEINSHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore,...
Senior Site Reliability Engineer

4 weeks ago

Palo Alto, United States SHEIN Technology LLC Full time

Job Title: Senior Site Reliability Engineer IReports to: Senior Manager of Site Reliability EngineeringJob Location: Palo Alto, CA, USAJob Status: Exempt, FTAbout SHEINSHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore,...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States Aptos Full time

Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way. Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to serve this purpose, we have dedicated several years toward this mission. We...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States Aptos Full time

Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way. Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to serve this purpose, we have dedicated several years toward this mission. We...
Senior Site Reliability Engineer

3 weeks ago

Palo Alto, United States ASSURED Full time

Job Description Job Description Assured is on a mission to modernize insurance. Claims processing (i.e. should we pay this claim?), while often overlooked, is the foundation of the entire industry. It’s currently highly manual, involving phone calls, faxes, and gut instinct—costing tens of billions of dollars a year. We can do better. At Assured, we...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States Aptos Full time

Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way. Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to serve this purpose, we have dedicated several years toward this mission. We...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States Aptos Full time

Aptos is a people-first blockchain on a mission to help billions of people achieve universal and fair access to decentralized assets in a safe and scalable way. Founded by some of the original creators and maintainers that researched, designed, and built the Diem blockchain to serve this purpose, we have dedicated several years toward this mission. We...
Site Reliability Engineer

3 weeks ago

Palo Alto, California, United States TEKsystems Full time

:Role: Site Reliability Engineer (SRE for Cloud)Location: Remote Project - MUST live in Pacific coast time zoneDuration: 1 year with possible extensionNumber of positions: 1We urgently looking for 1 Site Reliability Engineer (SRE for Cloud), mid level, who are available asap with the following skills:Role: Site Reliability Engineer (SRE): Global Payments...
Senior Site Reliability Engineer

3 weeks ago

Palo Alto, United States Assured Full time

Job DescriptionJob DescriptionAssured is on a mission to modernize insurance. Claims processing (i.e. should we pay this claim?), while often overlooked, is the foundation of the entire industry. It’s currently highly manual, involving phone calls, faxes, and gut instinct—costing tens of billions of dollars a year. We can do better.At Assured, we provide...
Senior Site Reliability Engineer

3 weeks ago

Palo Alto, United States Assured Full time

Job DescriptionJob DescriptionAssured is on a mission to modernize insurance. Claims processing (i.e. should we pay this claim?), while often overlooked, is the foundation of the entire industry. It’s currently highly manual, involving phone calls, faxes, and gut instinct—costing tens of billions of dollars a year. We can do better.At Assured, we provide...
Senior Mechanical Reliability Engineer

4 weeks ago

Palo Alto, United States Audubon Companies Full time

External Description Senior Mechanical Reliability Engineer Direct Hire Laplace, LA Immediate Need PTO, Benefits, and 401k Long term position This position is not open to international candidates, and does not offer relocation or per diem Audubon is currently seeking a Senior Mechanical Reliability Engineer to be part of a project team working onsite at a...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States TEKsystems Full time

Description: Role: Site Reliability Engineer (SRE for Cloud) Location: Remote Project - MUST live in Pacific coast time zone Duration: 1 year with possible extension Number of positions: 1 We urgently looking for 1 Site Reliability Engineer (SRE for Cloud), mid level, who are available asap with the following skills: Role: Site Reliability Engineer...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States TEKsystems Full time

Description: Role: Site Reliability Engineer (SRE for Cloud) Location: Remote Project - MUST live in Pacific coast time zone Duration: 1 year with possible extension Number of positions: 1 We urgently looking for 1 Site Reliability Engineer (SRE for Cloud), mid level, who are available asap with the following skills: Role: Site Reliability Engineer...
Site Reliability Engineer

3 weeks ago

Palo Alto, United States TEKsystems Full time

Description: Role: Site Reliability Engineer (SRE for Cloud) Location: Remote Project - MUST live in Pacific coast time zone Duration: 1 year with possible extension Number of positions: 1 We urgently looking for 1 Site Reliability Engineer (SRE for Cloud), mid level, who are available asap with the following skills: Role: Site Reliability...
Staff Site Reliability Engineer

1 week ago

Palo Alto, United States Assured Full time

Job DescriptionJob DescriptionAssured is on a mission to modernize insurance. Claims processing (i.e. should we pay this claim?), while often overlooked, is the foundation of the entire industry. It’s currently highly manual, involving phone calls, faxes, and gut instinct—costing tens of billions of dollars a year. We can do better.At Assured, we provide...
Site Reliability Engineer

2 weeks ago

Palo Alto, United States Mediaocean Full time

Mediaocean is powering the future of the advertising ecosystem with technology that empowers brands and agencies to deliver impactful omnichannel marketing experiences. With over $200 billion in annualized ad spend running through its software products, Mediaocean deploys AI and automation to optimize investments and outcomes. The company's advertising...
Senior Site Reliability Engineer New York City

3 weeks ago

Palo Alto, United States MongoDB Full time

The worldwide data management software market is massive (According to IDC, the worldwide database software market, which it refers to as the database management systems software market, was forecasted to be approximately $82 billion in 2023 growing to approximately $137 billion in 2027. This represents a 14% compound annual growth rate). At MongoDB we are...
Senior Site Reliability Engineer

1 week ago

Palo Alto, United States Plume Design Inc Full time

Life at Plume At Plume, we believe that technology isn't about moving faster, it's about making life’s moments better. Which is why we’ve built the world's first, and only, open and hardware-independent service delivery platform for smart homes, small businesses, enterprises, and beyond. Our SaaS platform uses WiFi, advanced AI, and machine learning to...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer