Site Reliability Engineer
1 month ago
We are seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency’s (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better awareness and visibility into their security posture and cyber threats. The CDM Data Services product is an integrated suite of multiple Commercial Off the Shelf (COTS) products, software configuration packages, and custom code which work together to operate as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements.
Seeking a talented Site Reliability Engineer (SRE) to play a key role in defining, implementing, and growing our SRE practice to ensure the reliability, availability, and performance of our critical production environments. The SRE will contribute to a culture of continuous improvement, identifying areas for enhancement, and driving initiatives to improve system reliability, scalability, and efficiency. The successful candidate will have demonstrated hands-on experience designing, implementing, and maintaining solutions to ensure that systems, including infrastructure and applications, are resilient, highly available, and performant.
The SRE will also play a critical role in defining and measuring the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our solution. The SRE will be responsible for setting up comprehensive logging, monitoring, and alerting solutions using the Elastic stack and other tools as necessary to ensure the continuous performance of services. Additionally, they will respond to incidents, perform root cause analyses, and implement solutions to prevent recurrences. The Journeyperson SRE will work in close collaboration with other SRE team members, developers, testers, infrastructure engineers, DevOps engineers, and other stakeholders to integrate reliability and observability into the software development lifecycle.
Required Skills
- US citizenship with ability to obtain Public Trust Suitability
- 4+ years of experience as a Site Reliability Engineer (SRE) or equivalent
- 4+ years of demonstrated experience designing, implementing, and maintaining observability solutions to include logging, monitoring, and alerting
- 4+ years of hands-on experience with SRE tools (e.g., Elastic, Prometheus, Grafana, Splunk, etc.)
- 2+ years defining and measuring SLOs and SLIs
- 2+ years of relevant experience using cloud platforms (AWS GovCloud preferred)
- 2+ years of hands-on programming or scripting (e.g., Python, Bash, etc.)
- Strong knowledge of microservices, containerization, and orchestration tools (Docker, Kubernetes)
- Proven ability to collaborate with cross-functional teams (development, testing, and product) to integrate reliability and observability into the software development lifecycle
- Strong problem-solving and analytical skills
- Proactive, detail-oriented approach to identifying inefficiencies and implementing improvements
Desired Skills
- Bachelor's degree in Computer Science, Engineering, or a related field (or 4 additional years of related experience)
- Experience working in an Agile/SAFe environment using ALM tools (Jira, Confluence, or similar)
- Strong understanding of CI/CD principles and platforms (Jenkins, CircleCI, GitLab, GitHub Actions, Argo, Travis CI, etc.)
- Expertise in configuration management tools (Ansible, Puppet, Chef)
- Experience with infrastructure as code (Terraform, CloudFormation)
- In-depth understanding of networking, security, and system administration of Linux operating systems
- Knowledge of version control platforms and branching strategies
- Knowledge of disaster recovery planning, backup strategies, and data replication
- Experience supporting large Federal programs ($200M+)
-
Site Reliability Engineer
1 month ago
Fairfax, United States Apex Systems Full timeWe are seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency’s (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better...
-
Site Reliability Engineer
6 days ago
Chicago, IL, United States WEX, Inc. Full timeThe WEX Site Reliability Engineering (SRE) team is seeking an entry-level Site Reliability Engineer Level 1 who is passionate about learning and growing in the field of software development and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits...
-
Site Reliability Engineer
7 days ago
Sunnyvale, CA, United States Natcast, Inc. Full timeNatcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (NSTC) consortium, established by the CHIPS Act of the U.S. government. Working at Natcast represents an opportunity to help extend America’s leadership in...
-
Site Reliability Engineer
1 month ago
Annapolis Junction, MD, United States Maximus Full timeGeneral information ...
-
Site Reliability Engineer
1 month ago
Duluth, GA, United States BlueSky Resource Solutions Full timeJob Title: Site Reliability Engineer – ObservabilityOverview:We are seeking a Site Reliability Engineer III to develop and maintain our observability platform. This role focuses on ensuring the reliability, performance, and scalability of microservices, Kubernetes clusters, and cloud infrastructure. You'll collaborate with cross-functional teams to deliver...
-
Site Reliability Engineer
6 days ago
Miami, FL, United States Royal Caribbean Group Full timeSite Reliability Engineer Journey with us! Combine your career goals and sense of adventure by joining our incredible team of employees at Royal Caribbean Group . We are proud to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world. We are proud to be the...
-
Redwood City, CA, United States C3 AI Full timeWe are looking for an Associate Site Reliability Engineer / Site Reliability Engineer to join our team at our HQ in Redwood City, CA. Responsibilities: Maximize system uptime and availability, ensuring functional and performance SLAs. Establish end-to-end monitoring and alerting on all critical aspects. Solve complex problems for critical services...
-
Site Reliability Engineer @ Mclean, VA
6 days ago
McLean, VA, United States CV Library Full timeRole: Site Reliability Engineer Location: Mclean or Richmond VA Type: Contract to hire Nice to have skills: Experience in Financial Domain Roles & Responsibilities: Experience with at least one of the following: Java, Python, or Go Experience working with AWS tools and services, DevOps environments Experience with agile practices 4+ years of site...
-
Site Reliability Engineer
6 days ago
Washington, DC, United States Alldus International Consulting Ltd Full timeOur client is a Series A startup within the Generative AI space and they are hiring a Site Reliability Engineer to join the team. Backed by one of the leading venture capital firms in the industry, this is an exciting opportunity to join a SaaS company that is revolutionizing their industry. Responsibilities: As the Site Reliability Engineer, you will...
-
Site Reliability Engineer
1 month ago
Portland, OR, United States Matlen Silver Full timeCompensation: $70 - $75/HourHybrid: 2 Days Onsite Portland, OregonDomain: Retail/Supply ChainJob Title: Site Reliability EngineerPosition SummaryAs a Site Reliability Engineer/DevOps Engineer, you will be responsible for ensuring the availability, performance, and reliability of Fulfillment Technology solutions for our client to support omni-channel...
-
McLean, VA, United States Capital One Full timeCenter 3 (19075), United States of America, McLean, Virginia Lead Platform Engineer, Site Reliability Engineering (SRE) Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of...
-
Site Reliability Engineer IN
3 weeks ago
Indianapolis, IN, United States BCforward Full timeSite Reliability EngineerBCforward is currently seeking a highly motivated Site Reliability Engineer for an opportunity in Remote!Position Title: Site Reliability EngineerLocation: RemoteAnticipated Start Date: 12/10/2024Please note this is the target date and is subject to change. BCforward will send official notice ahead of a confirmed start date.Expected...
-
Site Reliability Engineer
6 days ago
Aiea, HI, United States Smxtech Full timeSMX is seeking a Site Reliability Engineer to support the USINDOPACOM J6 portfolio of programs. This position is a hybrid between Camp H.M. Smith Marine Corps Base and Joint Base Pearl Harbor-Hickam in Hawaii. This position requires a DoD TS/SCI security clearance which requires US citizenship for work on DoD contracts. Responsibilities Independently manage...
-
Site Reliability Engineer
6 days ago
Sunnyvale, CA, United States Apple Inc. Full timeTo view your favorites, sign in with your Apple Account. Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. The people here at Apple don’t just create products —...
-
Site Reliability Engineer
3 weeks ago
Indianapolis, IN, United States BCforward Full timeSite Reliability EngineerBCforward is currently seeking a highly motivated Site Reliability Engineer for an opportunity in Remote!Position Title: Site Reliability EngineerLocation: RemoteAnticipated Start Date: 12/10/2024Please note this is the target date and is subject to change. BCforward will send official notice ahead of a confirmed start date.Expected...
-
Principal Site Reliability Engineer
6 days ago
Sunnyvale, CA, United States Microsoft Full timeThere has never been a more exciting time to be working in healthcare at Microsoft. Our Health & Life Sciences Solutions organization is an interdisciplinary team of product managers, designers, engineers, and clinicians who are designing, developing and deploying next-generation healthcare solutions powered by the Microsoft Cloud for healthcare...
-
Site Reliability Engineer
3 weeks ago
Miami, FL, United States INSPYR Solutions Full timeTitle: Site Reliability Engineer Make sure to apply quickly in order to maximise your chances of being considered for an interview Read the complete job description below. Location: Miami, FL Duration: 6+ months Compensation: $55.00 -60.00 Work Requirements: US Citizen, GC Holders or Authorized to Work in the U.S. Site Reliability...
-
Site Reliability Engineer II
6 days ago
Redmond, WA, United States Microsoft Full timeOverviewSecurity represents the most critical priorities for our customers in a world awash in digital threats, regulatory scrutiny, and estate complexity. Microsoft Security aspires to make the world a safer place for all. We want to reshape security and empower every user, customer, and developer with a security cloud that protects them with end to end,...
-
Site Reliability Engineer
6 days ago
Columbia, MD, United States Geon Technologies, LLC Full timeGeon Technologies is a rapidly growing small business that provides signal processing and sensor system integration services to the United States Government (USG) and the industry base that supports them. Geon seeks to be known for “signals, sensors, and systems”. Geon has expertise in the science and development of signal processing techniques and...
-
Site Reliability Engineer II
7 days ago
San Francisco, CA, United States Earnest Current Job Openings Full timeThe Site Reliability Engineer II position will report to the Lead Cloud Engineer. As an SRE II Engineer, you will: Set up and maintain comprehensive monitoring, create and refine playbooks, build dashboards, and adopt industry-standard practices to enhance the reliability and resilience of our site and systems. Develop and manage IaC to ensure reliable,...