Service Reliability Analyst II

2 months ago


LosAngeles, United States Riot Games Full time

The Process & Analytics team focuses on using operational data to understand the player experience and provide that visibility to Riot. This team strives to collect, audit and use data to improve our games’ operational health; empowering game leadership to make data informed decisions to improve stability.

As a Service Reliability Analyst , you will work with teams across Riot to build and execute effective ITIL processes, measurements of service health, and a highly contextual picture of the player experience. Your tenacity and drive for continuous improvement will help you uncover problematic trends and push for their resolution, improving the quality of the player experience. You will be a craft master in operational process and telling compelling visual stories with data. Live Ops can look to you to improve ITIL process, answer tough operational questions through data, and uncover previously unknown anti-patterns harming the player experience.

Responsibilities:
  • Lead and facilitate weekly technical discussions on service reliability with key product teams, ensuring alignment on operational goals and performance metrics.
  • Conduct thorough audits of incident data in collaboration with service owners to validate accuracy and ensure comprehensive reporting and analysis.
  • Collect, synthesize, and report on system health metrics for Riot's diverse infrastructure, utilizing advanced data collection methods and monitoring tools.
  • Perform in-depth analysis of operational data trends to identify and address systemic issues and optimize service performance.
  • Participate in on-call rotations to provide critical support and ensure rapid response to incidents, minimizing downtime and service disruptions.
  • Assist in tracking and coordinating corrective actions for root cause analysis, ensuring thorough resolution of underlying issues and continuous improvement of operational processes.
  • Develop and maintain dashboards and reports that provide insights into key operational performance metrics, assisting leaders with making data-driven decisions.
Required Qualifications:
  • 2-4 years of hands-on experience in IT service management, data analysis, or technical operations, with a focus on maintaining and optimizing IT infrastructure.
  • Strong proficiency in incident, problem, change, and release management, with the ability to design and implement process flows using industry-standard methodologies.
  • Solid understanding of software development life cycles (SDLC) and how various components interact within larger ecosystems, ensuring seamless operation and scalability.
  • Clear awareness of system and service ownership within a multi-team environment, including the effective use of APIs/SDKs and adherence to SLAs.
  • Deep enthusiasm for operations and technology, with a proactive approach to continuous improvement in system reliability and performance.
  • Experience with the following tools and technologies:
    • Data Visualization Tools: Advanced skills in Tableau, DataWrapper, and Excel for creating actionable insights from complex datasets.
    • Query Languages: Proficient in JQL, SQL, and XQuery for querying and manipulating data across various platforms.
    • Monitoring Solutions: Expertise in setting up and managing monitoring frameworks using tools like DataDog and NewRelic to ensure system health and performance.
    • Event Management Tools: Skilled in Event Correlation to improve Incident Response with tools such as Datadog, Big Panda or PagerDuty
    • ITIL-based Ticketing Systems: In-depth experience with ServiceNow, JIRA, or similar platforms for tracking and managing IT service processes.
Desired Qualifications:
  • 2+ years of specialized experience in Service Reliability Engineering (SRE) or equivalent roles such as Technical Release Manager, Process Owner, Live Operations Engineer, or Network Administrator.
  • Bachelor’s degree in Computer Science, IT Systems, Information Technology, or a closely related field, or equivalent professional experience.
  • Advanced data analysis and data insights proficiency, with the ability to derive actionable intelligence from large datasets.
  • Relevant certifications such as AWS Certified Solutions Architect, CompTIA Linux+, or CompTIA Network+, or equivalent credentials, are highly valued.
  • Demonstrated expertise in deploying and managing monitoring solutions such as DataDog and NewRelic to ensure system health and performance within complex environments

For this role, you'll find success through craft expertise, a collaborative spirit, and decision-making that prioritizes your fellow Rioters, who are the customers of your work. Being a dedicated fan of games is not necessary for this position

Our Perks:

Riot has a focus on work/life balance, shown by our open paid time off policy, in addition to other perks such as flexible work schedules. We offer medical, dental, and life insurance, parental leave for you, your spouse/domestic partner and children, and a 401k with company match. Check out our benefits pages for more information.

Riot Games fosters a player and workplace experience that values teamwork embodied by the Summoner's Code and Community Code . Our culture embraces differences as a strength, and our values are the guiding principles for how we approach work. We are committed to putting diversity and inclusion (D&I) at the center of everything we do, and promoting a fair and collaborative culture where Rioters treat one another with dignity and respect. We encourage you to read more about our value of thriving together and our ongoing work to build the most inclusive company in Gaming .