Site Reliability Engineer

1 month ago


Fairfax VA United States Apex Systems Full time

We are seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency’s (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better awareness and visibility into their security posture and cyber threats. The CDM Data Services product is an integrated suite of multiple Commercial Off the Shelf (COTS) products, software configuration packages, and custom code which work together to operate as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements.


Seeking a talented Site Reliability Engineer (SRE) to play a key role in defining, implementing, and growing our SRE practice to ensure the reliability, availability, and performance of our critical production environments. The SRE will contribute to a culture of continuous improvement, identifying areas for enhancement, and driving initiatives to improve system reliability, scalability, and efficiency. The successful candidate will have demonstrated hands-on experience designing, implementing, and maintaining solutions to ensure that systems, including infrastructure and applications, are resilient, highly available, and performant.


The SRE will also play a critical role in defining and measuring the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our solution. The SRE will be responsible for setting up comprehensive logging, monitoring, and alerting solutions using the Elastic stack and other tools as necessary to ensure the continuous performance of services. Additionally, they will respond to incidents, perform root cause analyses, and implement solutions to prevent recurrences. The Journeyperson SRE will work in close collaboration with other SRE team members, developers, testers, infrastructure engineers, DevOps engineers, and other stakeholders to integrate reliability and observability into the software development lifecycle.


Required Skills

  • US citizenship with ability to obtain Public Trust Suitability
  • 4+ years of experience as a Site Reliability Engineer (SRE) or equivalent
  • 4+ years of demonstrated experience designing, implementing, and maintaining observability solutions to include logging, monitoring, and alerting
  • 4+ years of hands-on experience with SRE tools (e.g., Elastic, Prometheus, Grafana, Splunk, etc.)
  • 2+ years defining and measuring SLOs and SLIs
  • 2+ years of relevant experience using cloud platforms (AWS GovCloud preferred)
  • 2+ years of hands-on programming or scripting (e.g., Python, Bash, etc.)
  • Strong knowledge of microservices, containerization, and orchestration tools (Docker, Kubernetes)
  • Proven ability to collaborate with cross-functional teams (development, testing, and product) to integrate reliability and observability into the software development lifecycle
  • Strong problem-solving and analytical skills
  • Proactive, detail-oriented approach to identifying inefficiencies and implementing improvements


Desired Skills

  • Bachelor's degree in Computer Science, Engineering, or a related field (or 4 additional years of related experience)
  • Experience working in an Agile/SAFe environment using ALM tools (Jira, Confluence, or similar)
  • Strong understanding of CI/CD principles and platforms (Jenkins, CircleCI, GitLab, GitHub Actions, Argo, Travis CI, etc.)
  • Expertise in configuration management tools (Ansible, Puppet, Chef)
  • Experience with infrastructure as code (Terraform, CloudFormation)
  • In-depth understanding of networking, security, and system administration of Linux operating systems
  • Knowledge of version control platforms and branching strategies
  • Knowledge of disaster recovery planning, backup strategies, and data replication
  • Experience supporting large Federal programs ($200M+)


  • Fairfax, United States Apex Systems Full time

    We are seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency’s (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better...


  • Chicago, IL, United States WEX, Inc. Full time

    The WEX Site Reliability Engineering (SRE) team is seeking an entry-level Site Reliability Engineer Level 1 who is passionate about learning and growing in the field of software development and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits...


  • Sunnyvale, CA, United States Natcast, Inc. Full time

    Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (NSTC) consortium, established by the CHIPS Act of the U.S. government. Working at Natcast represents an opportunity to help extend America’s leadership in...


  • Annapolis Junction, MD, United States Maximus Full time

    General information ...


  • Duluth, GA, United States BlueSky Resource Solutions Full time

    Job Title: Site Reliability Engineer – ObservabilityOverview:We are seeking a Site Reliability Engineer III to develop and maintain our observability platform. This role focuses on ensuring the reliability, performance, and scalability of microservices, Kubernetes clusters, and cloud infrastructure. You'll collaborate with cross-functional teams to deliver...


  • Miami, FL, United States Royal Caribbean Group Full time

    Site Reliability Engineer Journey with us! Combine your career goals and sense of adventure by joining our incredible team of employees at Royal Caribbean Group . We are proud to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world. We are proud to be the...


  • Redwood City, CA, United States C3 AI Full time

    We are looking for an Associate Site Reliability Engineer / Site Reliability Engineer to join our team at our HQ in Redwood City, CA. Responsibilities: Maximize system uptime and availability, ensuring functional and performance SLAs. Establish end-to-end monitoring and alerting on all critical aspects. Solve complex problems for critical services...


  • McLean, VA, United States CV Library Full time

    Role: Site Reliability Engineer Location: Mclean or Richmond VA Type: Contract to hire Nice to have skills: Experience in Financial Domain Roles & Responsibilities: Experience with at least one of the following: Java, Python, or Go Experience working with AWS tools and services, DevOps environments Experience with agile practices 4+ years of site...


  • Washington, DC, United States Alldus International Consulting Ltd Full time

    Our client is a Series A startup within the Generative AI space and they are hiring a Site Reliability Engineer to join the team. Backed by one of the leading venture capital firms in the industry, this is an exciting opportunity to join a SaaS company that is revolutionizing their industry. Responsibilities: As the Site Reliability Engineer, you will...


  • Portland, OR, United States Matlen Silver Full time

    Compensation: $70 - $75/HourHybrid: 2 Days Onsite Portland, OregonDomain: Retail/Supply ChainJob Title: Site Reliability EngineerPosition SummaryAs a Site Reliability Engineer/DevOps Engineer, you will be responsible for ensuring the availability, performance, and reliability of Fulfillment Technology solutions for our client to support omni-channel...


  • McLean, VA, United States Capital One Full time

    Center 3 (19075), United States of America, McLean, Virginia Lead Platform Engineer, Site Reliability Engineering (SRE) Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of...


  • Indianapolis, IN, United States BCforward Full time

    Site Reliability EngineerBCforward is currently seeking a highly motivated Site Reliability Engineer for an opportunity in Remote!Position Title: Site Reliability EngineerLocation: RemoteAnticipated Start Date: 12/10/2024Please note this is the target date and is subject to change. BCforward will send official notice ahead of a confirmed start date.Expected...


  • Aiea, HI, United States Smxtech Full time

    SMX is seeking a Site Reliability Engineer to support the USINDOPACOM J6 portfolio of programs. This position is a hybrid between Camp H.M. Smith Marine Corps Base and Joint Base Pearl Harbor-Hickam in Hawaii. This position requires a DoD TS/SCI security clearance which requires US citizenship for work on DoD contracts. Responsibilities Independently manage...


  • Sunnyvale, CA, United States Apple Inc. Full time

    To view your favorites, sign in with your Apple Account. Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. The people here at Apple don’t just create products —...


  • Indianapolis, IN, United States BCforward Full time

    Site Reliability EngineerBCforward is currently seeking a highly motivated Site Reliability Engineer for an opportunity in Remote!Position Title: Site Reliability EngineerLocation: RemoteAnticipated Start Date: 12/10/2024Please note this is the target date and is subject to change. BCforward will send official notice ahead of a confirmed start date.Expected...


  • Sunnyvale, CA, United States Microsoft Full time

    There has never been a more exciting time to be working in healthcare at Microsoft. Our Health & Life Sciences Solutions organization is an interdisciplinary team of product managers, designers, engineers, and clinicians who are designing, developing and deploying next-generation healthcare solutions powered by the Microsoft Cloud for healthcare...


  • Miami, FL, United States INSPYR Solutions Full time

    Title: Site Reliability Engineer Make sure to apply quickly in order to maximise your chances of being considered for an interview Read the complete job description below. Location: Miami, FL Duration: 6+ months Compensation: $55.00 -60.00 Work Requirements: US Citizen, GC Holders or Authorized to Work in the U.S. Site Reliability...


  • Redmond, WA, United States Microsoft Full time

    OverviewSecurity represents the most critical priorities for our customers in a world awash in digital threats, regulatory scrutiny, and estate complexity. Microsoft Security aspires to make the world a safer place for all. We want to reshape security and empower every user, customer, and developer with a security cloud that protects them with end to end,...


  • Columbia, MD, United States Geon Technologies, LLC Full time

    Geon Technologies is a rapidly growing small business that provides signal processing and sensor system integration services to the United States Government (USG) and the industry base that supports them. Geon seeks to be known for “signals, sensors, and systems”. Geon has expertise in the science and development of signal processing techniques and...


  • San Francisco, CA, United States Earnest Current Job Openings Full time

    The Site Reliability Engineer II position will report to the Lead Cloud Engineer. As an SRE II Engineer, you will: Set up and maintain comprehensive monitoring, create and refine playbooks, build dashboards, and adopt industry-standard practices to enhance the reliability and resilience of our site and systems. Develop and manage IaC to ensure reliable,...