Site Reliability Manager

1 day ago


Washington DC USA, United States Karsun Solutions Full time
About Karsun Solutions

Karsun Solutions is a leading provider of innovative technology solutions to the US Government. Our team is dedicated to delivering high-quality services that transform the way our clients operate.

Job Summary

We are seeking a highly skilled Site Reliability Manager to join our team. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of our systems and services. They will lead a team of engineers in designing, implementing, and maintaining robust infrastructure and automation solutions.

Key Responsibilities
  • Lead a service delivery team of 8-20 people (Service Support specialist, DevSecOps and Site reliability engineers)
  • Define and implement best practices for infrastructure as code, deployment automation, and monitoring
  • Collaborate with cross-functional teams to design scalable and fault-tolerant architectures
  • Develop and maintain service level objectives (SLOs) and key performance indicators (KPIs) to measure system reliability and performance
  • Conduct post-mortems and root cause analyses for incidents and implement preventive measures to mitigate future incidents
  • Drive continuous improvement initiatives to enhance the reliability, scalability, and efficiency of our systems and services
  • Mentor and coach team members to foster a culture of learning and innovation
Requirements
  • Bachelor's degree in computer science, Engineering, or a related field; Master's degree preferred
  • 10+ years of experience in a similar role managing a team of site reliability engineers and delivering in AWS cloud platform
  • Proven track record of managing high-performance teams
  • 5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable and high available
  • Deep understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes)
  • Strong knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines
  • Experience with monitoring, logging, and observability tools like DataDog, AWS Cloudwatch, ELK, Prometheus, Splunk etc.
  • Excellent communication and interpersonal skills, with the ability to collaborate effectively with cross-functional teams
  • Strong problem-solving and analytical skills, with a keen attention to detail
  • Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus
  • Ability to obtain and maintain a Public Trust clearance
Preferred Qualifications
  • Understanding of modern architecture, e.g. micro-services, EDA, etc., and cautious against overcomplexity and overengineering
  • Experience with monitoring and metrics platforms, e.g. New Relic, Prometheus, InfluxDB, Grafana, Splunk, etc
  • Experience designing and operating distributed systems and cloud infrastructure at scale
What We Offer

We offer a competitive salary range of $140,000.00 to $180,000.00, depending on experience and qualifications. We also offer a comprehensive benefits package, including health, life, and disability insurance, paid parental leave, 401k retirement plan, and more.

Karsun Solutions is an Equal Employment Opportunity (EEO) employer. We are committed to building an inclusive and diverse workplace culture.



  • Washington, DC , USA, United States Palantir Technologies Full time

    {"title": "Site Reliability Engineer", "description": "Job SummaryWe are seeking a skilled Site Reliability Engineer to join our team at Palantir Technologies. As a Site Reliability Engineer, you will be responsible for designing, deploying, and operating high-performance, scalable, and reliable services for our production infrastructure.Key...


  • Washington, DC , USA, United States MetroStar Corporation Full time

    Job Title: Site Reliability EngineerAt MetroStar Corporation, we are seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, performance, and scalability of our systems.Key Responsibilities:Monitor and analyze platform and containerized applications to...


  • Washington, DC , USA, United States Mount Indie Full time

    Job Title: Site Reliability EngineerAt Mount Indie, we're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you'll play a critical role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Monitor and analyze platform and containerized applications...


  • Washington, DC , USA, United States Veterans Enterprise Technology Solutions Full time

    Job Title: Site Reliability EngineerJob Summary:We are seeking a highly skilled Site Reliability Engineer to join our team at Veterans Enterprise Technology Solutions. As a Site Reliability Engineer, you will be responsible for ensuring the optimal performance and availability of our platform and containerized applications.Responsibilities:Monitor and...


  • Washington, DC , USA, United States Veterans Enterprise Technology Solutions Full time

    Job Title: Site Reliability EngineerJob Summary:We are seeking a highly skilled Site Reliability Engineer to join our team at Veterans Enterprise Technology Solutions. As a Site Reliability Engineer, you will be responsible for ensuring the optimal performance and availability of our platform and containerized applications.Responsibilities:Monitor and...


  • Washington, DC , USA, United States Palantir Technologies Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our team at Palantir Technologies. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our systems and applications.Key ResponsibilitiesCollaborate with cross-functional teams to design, implement, and maintain...


  • Washington, DC , USA, United States Kansas Action for Children Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at T-Mobile USA, Inc. in Overland Park, Kansas, United States.About the Role:The Principal Site Reliability Engineer will play a crucial role in improving system reliability and resilience, facilitating faster and more efficient...


  • Washington, DC , USA, United States TEKsystems Full time

    Job SummaryTEKsystems is seeking a highly skilled DevOps/Site Reliability Engineer to join a large-scale migration project at one of Japan's largest financial institutions.This is an exciting opportunity to be part of a critical project within the organization, driving the architecture and setup of pipelines for migrations.Key Responsibilities:Design and...


  • Washington, DC , USA, United States Cape Full time

    About CapeCape is a pioneering company that's redefining the boundaries of privacy and national security in the wireless industry. Founded in 2022 by a team of experts from Palantir and Anduril, we're driven by a passion for creating a more secure and private mobile experience.The RoleWe're seeking a highly skilled Site Reliability Engineer to join our team....


  • Washington, DC , USA, United States Radius Networks Inc Full time

    About Radius Networks IncRadius Networks Inc is the global leader in location technology solutions, powering some of the world's largest restaurant, grocery, retail, and hospitality brands with its Flybuy platform. Flybuy helps companies deliver a seamless customer experience, boost loyalty, and drive efficient staff operations.Job SummaryWe're seeking a...


  • Washington, DC, USA, United States Mount Indie Full time

    Job OverviewMt. Indie is seeking a highly skilled Site Reliability Engineer to join our team. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key ResponsibilitiesMonitor and analyze system performance, identifying areas for improvement and implementing...


  • Washington, Washington, D.C., United States Karsun Solutions Full time

    Job Title: Site Reliability ManagerWe are seeking a highly skilled and experienced Site Reliability Manager to join our team at Karsun Solutions. As a key member of our organization, you will be responsible for ensuring the reliability, scalability, and performance of our systems and services.Key Responsibilities:Lead a team of engineers in designing,...


  • Washington, Washington, D.C., United States Karsun Solutions Full time

    Job Title: Site Reliability ManagerWe are seeking a highly skilled and experienced Site Reliability Manager to join our team at Karsun Solutions. As a key member of our organization, you will be responsible for ensuring the reliability, scalability, and performance of our systems and services.Key Responsibilities:Lead a team of engineers in designing,...


  • Washington, DC , USA, United States Splunk Full time

    About SplunkSplunk is a leading provider of unified security and observability platforms, helping enterprises build a safer and more resilient digital world.Our mission is to empower organizations to keep their digital systems secure and reliable, and we're committed to creating a culture of belonging and diversity.Job SummaryWe're seeking a highly skilled...


  • Washington, DC , USA, United States TEKsystems Full time

    Job SummaryWe are seeking an experienced Senior Site Reliability Engineer/DevOps Engineer with a minimum of 8 years of expertise to join our dynamic engineering team.About the RoleAs a Senior SRE/DevOps, you will be instrumental in ensuring the availability, performance, and reliability of our systems, with a strong emphasis on security...


  • Washington, Washington, D.C., United States Varada Consulting, LLC Full time

    Job Title: Site Reliability EngineerVarada Consulting, LLC is seeking a highly skilled and experienced Site Reliability Engineer to join our team. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of our systems and applications through automation, monitoring, and infrastructure improvements.Key...


  • Washington, Washington, D.C., United States Alldus Full time

    Site Reliability EngineerAlldus is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our systems.Key Responsibilities:Perform root cause analysis to identify and resolve system or application issues in a timely and...


  • Washington, Washington, D.C., United States Alldus Full time

    Site Reliability EngineerAlldus is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Perform root cause analysis to identify and resolve system or application issues in a...


  • Washington, United States Varada Consulting Full time

    Site Reliability EngineerJob Location-Washington, DC; HybridOverview:Varada Consulting, LLC is seeking a full-time highly skilled and experienced Site Reliability Engineer (SRE) to join our team. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of our systems and applications through automation, monitoring, and...


  • washington, United States Varada Consulting Full time

    Site Reliability EngineerJob Location-Washington, DC; HybridOverview:Varada Consulting, LLC is seeking a full-time highly skilled and experienced Site Reliability Engineer (SRE) to join our team. As an SRE, you will be responsible for ensuring the reliability, scalability, and performance of our systems and applications through automation, monitoring, and...