Site Reliability Engineer

2 weeks ago


Los Angeles, California, United States CoSM Full time
About CoSM

CoSM is a global technology company that brings experiences to life in immersive environments. We help our partners create spaces and content that blur the lines of real and virtual across three primary markets: Sports and Entertainment, Science and Education, and Parks and Attractions.

We were born from the fusion of some of the greatest innovators in the history of technology. Evans & Sutherland, Spitz, Inc., and CoSM Immersive combined forces to power the immersive experiences of the future as CoSM.

Job Summary

As a Site Reliability Engineer at CoSM, you will play a pivotal role in designing, implementing, automating, and maintaining the technology infrastructure that supports our organization's operations center.

You will be responsible for designing robust, scalable, and resilient platforms that facilitate real-time monitoring, analysis, and decision-making processes critical to our business and product operations.

You will liaise with product and engineering teams to ensure applications and microservices support telemetry ingestion for actionable alerting and historical data graphing, thus building a continuous feedback loop for platform and product reliability.

The ideal candidate is a solutions-oriented person who can learn new technologies quickly and who can become competent with all layers of the development platform.

They should be willing to roll up their sleeves and be familiar with various technologies but know how to choose the best technology for the job.

ResponsibilitiesMonitoring and Alerting

Design and automate robust monitoring and alerting mechanisms to ensure the health, performance, and availability of the operations center platform, products, and associated infrastructure components.

Application Monitoring

Work with software engineering and product teams to best understand how to monitor their applications and microservices.

Infrastructure Deployment

Collaborate with infrastructure teams to deploy and configure the necessary hardware and software components to support the operations center platform, including servers, networks, databases, and monitoring tools.

Documentation and Training

Create comprehensive documentation, diagrams, and guides to facilitate system understanding, troubleshooting, and knowledge transfer. Provide training and support to operations center staff on platform usage and best practices.

Collaboration and Stakeholder Management

Collaborate closely with cross-functional teams, including product, operations, IT, security, and business units, to understand requirements, gather feedback, and align observability platform architecture with organizational goals and priorities.

Incident Management

Work an on-call rotation to troubleshoot and resolve incidents, working closely with the support team to ensure prompt resolution.

Continuous Learning

Stay informed about industry trends and emerging technologies related to Windows Server, on-premises infrastructure, and Azure and AWS Cloud platforms.

Leadership

Provide technical guidance and mentorship to junior team members as needed.

Communication

Exemplify excellent written and verbal communication skills and the ability to tailor technical communications to any audience deftly.

Requirements

Bachelor's or master's degree in computer science, Information Technology, or a related field, or relevant work experience.

6 years of proven experience as a platform engineer, site reliability engineer, systems engineer, or a similar role, with a focus on designing, implementing, and monitoring the health of complex, distributed systems.

Expert-level knowledge of Grafana, Prometheus, Loki, and Tempo

Familiarity with scripting languages for automation and configuration management. PowerShell & BASH are paramount.

Strong understanding of cloud computing concepts and hands-on experience with Azure and/or AWS

Experience with virtualization/containerization technologies such as Hyper-V or VMware, Amazon EC2, Docker & Kubernetes

Experience using Pulumi, Terraform and/or other IaC tools.

In-depth knowledge of Windows Server operating systems 2016/2019/2022, including installation, configuration, and troubleshooting.

Familiarity with Linux automation with tools such as Ansible or Puppet is a plus.

Expertise in data retrieval technologies, including constructing efficient PromQL, GraphQL & LogQL queries.

Solid understanding of networking principles and protocols.

Excellent problem-solving and troubleshooting skills, with a keen attention to detail.

Strong communication and interpersonal skills, with the ability to collaborate effectively with clients and team members.

Driven to automate your processes, test continually, and document your work

Experience in working with a cross-functional, distributed team from concept through completion and future iterations including agile methodologies.

Excellent time management skills.

Preferred-Certifications in cloud platforms (e.g., AWS Certified Solutions Architect, Azure Solutions Architect) or similar

The annualized base salary range for this position in California is $105,000 to $140,000.

The base salary offered will factor in internal equity and may also vary depending on the candidate's geographic region, job-related knowledge, skills, and relevant experience, among other factors.

Cosm is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.



  • Los Angeles, California, United States Capgemini Full time

    Job Title: Site Reliability EngineerCapgemini is seeking a skilled Site Reliability Engineer to join our team in Sunnyvale, CA or Sylmar, CA. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and performance of our cloud-based applications using Azure Kubernetes Services (AKS).Key Responsibilities:Maintain and improve...


  • Los Angeles, California, United States Tik Tok Full time

    {"title": "Site Reliability Engineer", "content": "About the RoleTikTok is seeking an experienced Site Reliability Engineer to join our USDS Video Platform team. As a key member of our team, you will be responsible for ensuring the reliability and scalability of our video system, which serves billions of users worldwide.As a Site Reliability Engineer, you...


  • Los Angeles, California, United States City National Bank Full time

    {"title": "Site Reliability Engineer", "description": "Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our team at City National Bank. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key Responsibilities:Implement solutions that...


  • Los Angeles, California, United States Tik Tok Full time

    About the RoleWe are seeking an experienced Site Reliability Engineer to join our USDS Video Platform team at TikTok. As a key member of our team, you will be responsible for ensuring the reliability and scalability of our video system, which serves billions of users worldwide.ResponsibilitiesDesign and implement scalable and reliable systems to support our...


  • Los Angeles, California, United States StubHub Full time

    About the OpportunityStubHub is seeking a Senior Site Reliability Engineer to design and develop next-generation technologies and complex features. As a key member of our team, you will be responsible for ensuring the reliability, availability, and performance of our critical systems.Key ResponsibilitiesBuild and maintain an observability platform to monitor...


  • Los Angeles, California, United States City National Bank Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at City National Bank. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key ResponsibilitiesImplement solutions that improve stability, security,...


  • Los Angeles, California, United States City National Bank Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at City National Bank. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our systems in the Data Center or Cloud Platform.Key ResponsibilitiesImplement solutions that improve stability, security,...


  • Los Angeles, California, United States StubHub Full time

    About the RoleStubHub is seeking a Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for designing and developing next-generation technologies and complex features to ensure the reliability, availability, and performance of our critical systems.Key ResponsibilitiesBuild and maintain an...


  • Los Angeles, California, United States StubHub Full time

    About the RoleStubHub is seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for designing and developing next-generation technologies and complex features to ensure the reliability, availability, and performance of our critical systems.Key ResponsibilitiesBuild...


  • Los Angeles, California, United States StubHub Full time

    About the RoleStubHub is seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for designing and developing next-generation technologies and complex features to ensure the reliability, availability, and performance of our critical systems.Key ResponsibilitiesBuild...


  • Los Angeles, California, United States StubHub Full time

    About the RoleStubHub is seeking a Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for designing and developing next-generation technologies and complex features to ensure the reliability, availability, and performance of our critical systems.Key ResponsibilitiesBuild and maintain an...


  • Los Angeles, California, United States City National Bank Full time

    Job Title: Site Reliability Principal EngineerAt City National Bank, we're seeking a highly skilled Site Reliability Principal Engineer to join our team. As a Site Reliability Principal Engineer, you will play a critical role in ensuring the reliability, scalability, and maximum uptime of our cloud-based systems.Key Responsibilities:Design, build, and manage...


  • Los Angeles, California, United States Disqo Full time

    About DISQODISQO is the brand experience (BX) platform for understanding every customer experience. Businesses trust DISQO to power better decisions for every customer, touchpoint, and outcome. DISQO's insights, agile testing and advertising measurement products are powered by millions of consumers on the industry's largest opt-in consumer data platform.Our...