System Reliability Specialist

5 days ago


San Francisco, California, United States ESL FACEIT Group Full time

At ESL FACEIT Group, we strive to create immersive experiences that bring players and fans together. Our corporate social responsibility is centered around the idea of "GG for all," where everyone has an equal chance to succeed.

About Us

We're passionate about cultivating a culture that supports the growth of esports, gaming tournaments, leagues, events, and holistic ecosystems. With millions of players, fans, and heroes engaging with our platform, we aim to provide a world beyond gameplay where everyone feels welcome.

Job Description

As a System Reliability Specialist at ESL FACEIT Group, you will play a crucial role in designing, analyzing, and troubleshooting large-scale distributed systems. You will demonstrate a systematic problem-solving approach, debug and optimize code, and automate routine tasks to ensure our services and systems are reliable and meet user expectations.

You will work collaboratively with software engineering teams to deploy and operate our systems, helping to automate and streamline operations and processes. Within this role, you will be given real responsibilities and have the opportunity to drive change and make a significant impact on our products and platform.

Key Responsibilities
  1. Maintaining and improving monitoring and observability tools (Grafana/Prometheus/Thanos/Jaeger);
  2. Collaborating with cross-functional teams to design, maintain, and operate systems at scale;
  3. Developing and driving adoption of SRE best practices across the company;
  4. Leading incident management process and adoption;
  5. Using troubleshooting skills to identify and fix operational issues;
  6. Working with Cloud Native technologies such as Kubernetes, Envoy, Istio, Prometheus, and Helm;
  7. Experimenting with and introducing cutting-edge technologies.
Requirements
  1. Proven experience as a Site Reliability Engineer or Software Engineer, focusing on building and maintaining scalable infrastructures;
  2. Excellent working knowledge of major cloud providers (GCP/AWS/Azure);
  3. Experience with cluster management systems (Kubernetes);
  4. Knowledge of incident management: ability to investigate, troubleshoot, recover, and prevent recurrence of incidents;
  5. Proficient in Go language and some level of proficiency in at least another language: Java, Python, Rust;
  6. Knowledge of GitOps practices;
  7. Production scale experience with one of the following: MongoDB, Redis, MySQL;
  8. Experience contributing to open-source technologies would be an added bonus.

Estimated salary range for this position is $120,000 - $180,000 per year, depending on location and experience.



  • San Diego, California, United States Qualcomm Full time

    About UsQualcomm is a leading technology company that develops innovative solutions for mobile devices, automotive, and IoT industries. Our team is passionate about delivering high-quality products and services that exceed customer expectations.Job DescriptionWe are seeking a highly skilled System Reliability Specialist to join our team at Qualcomm. As a key...


  • San Francisco, California, United States Cloudflare Inc Full time

    About UsScaling with CloudflareAt Cloudflare, we're scaling rapidly, and we need talented engineers to help us keep up. As a system reliability engineer, you'll play a critical role in ensuring the stability and performance of our global network.We protect and accelerate any internet application online without adding hardware, installing software, or...


  • San Francisco, California, United States Gridware Full time

    **Job Summary:**We are seeking a skilled Electronics Reliability Specialist to join our team at Gridware. As a Product Testing and Reliability Engineer, you will play a critical role in ensuring the reliability of our advanced sensing system that continuously analyzes both the electrical and mechanical behavior of grid assets.**Responsibilities:**Oversight...


  • San Francisco, California, United States OpenAI Full time

    We are seeking an experienced Reliability Systems Architect to join our team at OpenAI in San Francisco.This role involves designing and implementing scalable infrastructure solutions that meet the rapidly increasing demands of our users. As a key member of our engineering team, you will collaborate with cross-functional teams to ensure the reliability,...


  • San Francisco, California, United States Unreal Gigs Full time

    Job Title: System Reliability ManagerCompany Overview:">">We're a forward-thinking company that values expertise and teamwork.">">Salary: $130,000 per year">">Job Description:">">We're looking for a seasoned System Reliability Manager to oversee the reliability and scalability of our cloud infrastructure.">">Key Responsibilities:">">">">Cross-Functional...


  • San Francisco, California, United States workable - ATS Full time

    Protect the electrical grid from disruptions and ensure a reliable supply of power. As an Electronics Reliability Specialist at Gridware, you will be responsible for developing and executing comprehensive reliability testing plans to identify and mitigate potential issues in our sensing system.We are a privately held company backed by top climate-tech and...


  • San Francisco, California, United States WEX, Inc. Full time

    About the Role:The WEX Site Reliability Engineering team is seeking a technical leader with expertise in designing, implementing, and managing complex systems at scale. This Senior Staff SRE will work closely with engineering teams to ensure that our systems are reliable, performant, and secure.Key Responsibilities:Technical Leadership: Provide guidance and...


  • San Francisco, California, United States King Courier Full time

    About the JobWe are seeking a reliable delivery specialist to join our team at King Courier. As a delivery specialist, you will be responsible for delivering documents and packages using your own vehicle.Key Responsibilities:Deliver documents and packages efficiently and effectivelyProvide exceptional customer service to clientsMaintain accurate records of...


  • San Francisco, California, United States Diverse Lynx Full time

    About the RoleWe are seeking a skilled Reliability Engineering Specialist to join our team at Diverse Lynx LLC. This is an exciting opportunity for a motivated and experienced professional to contribute to the success of our organization.


  • San Francisco, California, United States Oven Full time

    About Our CompanyBun, an open-source JavaScript tooling company, seeks to make programming more accessible. Backed by significant investments from top investors in Silicon Valley, we've gained recognition as one of the top GitHub repositories, boasting a vibrant community of over 33,000 Discord members.As part of our team, you'll play a crucial role in...


  • San Francisco, California, United States Orb Full time

    Revolutionizing Billing InfrastructureAt Orb, we're on a mission to transform the way businesses bill and manage their revenue. By leveraging cutting-edge technology, we enable companies to automate their billing processes and adapt pricing strategies with ease.Our approach prioritizes collaboration, focus, and kindness, fostering a culture that values...


  • San Diego, California, United States Booz Allen Hamilton Full time

    Job OverviewWe are seeking a skilled Reliability Systems Engineer to join our team at Booz Allen Hamilton. This role will involve leading reliability analysis and shaping Navy undersea systems.About the PositionThis position requires 4+ years of experience in engineering, with a focus on reliability analysis and system design. You should have experience...


  • San Francisco, California, United States WEX, Inc. Full time

    About WEX, Inc.">WEX, Inc. is a leading provider of business and personal payment processing solutions. Our company has a strong commitment to innovation, customer service, and operational excellence.Job Summary">We are seeking an entry-level Software Development Engineer for System Reliability to join our team. As a member of our Benefits Reliability...


  • San Francisco, California, United States Insight Global Full time

    Job SummaryThis position is open to an experienced equipment maintenance specialist with a proven track record of ensuring smooth production operations. As a key member of our client's team in San Francisco, CA, you will be responsible for the preventative maintenance of all production equipment at two locations.The ideal candidate will have 5+ years of...


  • San Francisco, California, United States Gridware Full time

    About GridwareGridware is a pioneering company that develops cutting-edge technologies to enhance and protect the electrical grid, which forms the backbone of our modern society. Our mission is to ensure the reliability and safety of this critical infrastructure.We are headquartered in the Bay Area, California, and backed by top climate-tech and Silicon...


  • San Francisco, California, United States Springshot Full time

    We are seeking a highly skilled Senior Site Reliability Engineer to join our team at Springshot. Based in the San Francisco Bay Area, this role will play a critical part in maintaining the reliability and performance of our SaaS platform.">Job OverviewWe are a passionate, tight-knit team who moves fast and is continuously innovating and improving our...


  • San Francisco, California, United States Unreal Gigs Full time

    Job Title: Reliability Guardian for RoboticsEstimated Salary: $140,000 - $200,000 per yearJob Description: We are seeking a skilled Robotics Quality Assurance Specialist to join our team at Unreal Gigs. In this role, you will oversee the testing, verification, and validation processes to guarantee our robots meet rigorous quality standards.Key...


  • San Diego, California, United States JobsRUs Full time

    Job OverviewJobsRUs.com. is seeking to hire a Reliable Equipment Specialist for our client in San Diego, CA.Salary Information$43.27 per hour, paid weekly.Job ResponsibilitiesThe Reliable Equipment Specialist repairs and maintains assigned manufacturing related equipment throughout the facility in accordance with standard procedures, internal requirements,...


  • San Francisco, California, United States Cloudflare, Inc. Full time

    We are Cloudflare, a highly ambitious and large-scale technology company with a soul. Our mission is to help build a better Internet by protecting the free and open Internet.As a key member of our team, you will play a crucial role in building and operating our Edge platform running in over 320 cities across more than 120 countries. This is an exceptional...


  • San Jose, California, United States Zscaler Full time

    Zscaler, a pioneer in cloud security, is seeking a talented individual to join our Engineering team as a Site Reliability Engineering Intern.As a key member of our team, you'll be responsible for designing and implementing automation tools, developing CI/CD pipelines, creating dashboards, and working on incident management automation.We're looking for...