Sr. Site Reliability Engineer

19 hours ago

McLean VA United States Root Center For Advanced Recovery Full time

Overview

Design. Disrupt. Repeat.
Be an agent of change on a team committed to achieving client-focused, mission-driven excellence. Steampunk is looking for an experienced Site Reliability Engineer with an appetite for taking on new challenges.

Who We Are
Steampunk is the explosive collision of human-centered design and traditional government contracting. An employee-owned company with a startup mindset and time-tested approaches tailored for the federal government, we’re passionate about creating solutions that are impactful, practical, scalable, and most importantly, that meet our clients’ ever-changing needs.

At Steampunk, we believe in disrupting the status quo and setting the pace in the ecosystem of government contractors, while repurposing tried-and-true methodologies. We believe in empowering our people to find creative solutions to intractable problems. We believe the best environment in which to grow and thrive is outside our comfort zone. While good design makes for a good product, we believe human-centered design makes for an excellent one.

We also believe effective teams are powered by diverse perspectives, backgrounds, and experiences. To that end, Steampunk is an equal opportunity employer committed to promoting diversity of race, gender, sexual orientation, religion, ethnicity, national origin, disability status, and protected veteran status, amongst our ranks. Additionally, we participate in the E-Verify program.

Why Steampunk?
Our people are the very core of what we do; their expertise and hunger for new and exciting challenges fuel our relentless pursuit of mission success. As part of our team of “Punks,” you’ll test the status quo, explore new boundaries, and set the bar high for how government clients expect to engage with contractors.

Because we value our employees’ work/life balance (and believe those who work hard deserve to play hard), we offer a very competitive benefits package, including telework/flex scheduling, health/dental with orthodontics/vision insurance upon hire, paid time off with a sell-back benefit and carryover option, 11 Federal Holidays, 100% paid military leave, 100% 401(k) plan match upon hire, professional development/education reimbursement, all flexible spending accounts, and more.

Contributions

As a Sr. Steampunk Site Reliability Engineer (SRE) , you will be responsible for working with program development teams, infrastructure and platform services teams, and traditional operations and maintenance teams to embrace and embody a shared responsibility for the reliability of an organizations’ applications and infrastructure. As an SRE, your primary responsibility is to combine aspects of software engineering with traditional operations to maintain and improve the reliability, availability, and performance of cloud, infrastructure, and large-scale software systems and services while minimizing downtime and mitigating potential failures.

There are a wide variety of responsibilities you will be delivering in this role:

Infrastructure Optimization: Conduct in-depth analyses of infrastructure, identifying areas for improvement in terms of performance, scalability, and resource utilization. Collaborate with development and operations teams to implement enhancements, utilizing software engineering and/or infrastructure-as-code principles to streamline deployment processes and ensure consistency across environments.
Reliability Metrics and Reporting: Define and implement key reliability metrics, service-level objectives (SLOs), and service-level indicators (SLIs) to measure and report on the health of our systems. Establish monitoring and alerting mechanisms to proactively identify potential issues before they impact users.
Automation and Tooling: Design and implement automation tools to reduce manual toil, streamline repetitive tasks, and enhance overall operational efficiency. Leverage software development techniques to create robust, scalable tooling that supports our reliability goals, and collaborate with development teams to integrate reliability features into the development lifecycle.
Performance Optimization using Software Development Techniques: Collaborate with software development teams to optimize the performance and resilience of services through code improvements, architectural enhancements, and performance tuning. Integrate automated testing and profiling into the development pipeline to identify and address performance bottlenecks early in the development lifecycle.
Capacity Planning and Scaling: Collaborate with infrastructure teams to forecast capacity requirements, ensuring our systems can seamlessly scale to meet growing user demands. Implement strategies for auto-scaling and load balancing to optimize resource utilization and enhance overall system stability.
Collaboration and Training: Work closely with development teams to embed reliability best practices into the software development process. Provide mentorship and training to cross-functional teams on SRE principles, encouraging a shared responsibility for the reliability of our services.
Incident Management: Lead the development and implementation of incident response procedures, ensuring timely and effective resolution of issues to minimize impact on users. Foster a culture of continuous improvement by conducting thorough post-incident reviews, identifying root causes, and implementing preventative measures.
Infrastructure and Systems Monitoring: Observe and monitor systems to make sure you have the insight into system performance, health, availability and what is happening internally in the system. Understand what to monitor based on the system(s) you are managing, where to store the monitoring data, who can access historical monitoring data, and how to look at the data to make determinations about future actions.

Qualifications

Required:

Bachelor's degree and at least 10 years of IT experience
Eligible to obtain and maintain a government security clearance
Knowledge and experience with Agile and DevSecOps methodologies
Experience in system Engineering in one or more areas including telecommunications concepts, computer languages, operating systems, database/Data Base Management System (DBMS) and middleware
Experience with the following software/tools:
Source code and binary repository products and techniques (GitHub, GitLab, BitBucket, Artifactory, Nexus, etc.)
Infrastructure and Cloud Management tools such as AWS CloudWatch
Log Management and Analysis tools such as Splunk
Automation and Configuration Management tools such as Terraform or Puppet

Preferred:

Knowledge and experience with NewRelic and/or other AIOps platforms
Have programming skills – Javascript, Ruby and/or Go
Experience with Nginx, HAProxy, Docker, Kubernetes or similar technologies
Experience with messaging systems, collaboration software, application-based firewall and proxy server(s), and operating systems
Experience with Linux and Windows operating systems, along with scripting tools and techniques such as Bash, CSH, KSH, ZSH, etc. and/or Powershell.
Experience with Monitoring and Alerting tools such as Prometheus, Grafana and Datadog

About Steampunk

Steampunk is a Change Agent in the Federal contracting industry, bringing new thinking to clients in the Homeland, Federal Civilian, Health and DoD sectors. Through our Human-Centered delivery methodology , we are fundamentally changing the expectations our Federal clients have for true shared accountability in solving their toughest mission challenges. As an employee owned company , we focus on investing in our employees to enable them to do the greatest work of their careers – and rewarding them for outstanding contributions to our growth. If you want to learn more about our story, visit .

We are an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, or any other characteristic protected by law. Steampunk participates in the E-Verify program.

#J-18808-Ljbffr

Sr. Site Reliability Engineer

2 days ago

McLean, VA, United States GameStop Full time

Overview Design. Disrupt. Repeat. Be an agent of change on a team committed to achieving client-focused, mission-driven excellence. Steampunk is looking for an experienced Site Reliability Engineer with an appetite for taking on new challenges. Who We Are Steampunk is the explosive collision of human-centered design and traditional government...
Senior Site Reliability Engineer

2 months ago

McLean, United States Zachary Piper Solutions Full time

Piper Companies is seeking a Site Reliability Engineer to support a world leading data analytics product & service provider. The Site Reliability Engineer will be expected to provide automation, cloud optimization, security implementation, and compliance support. Responsibilities of the Site Reliability Engineer include: ·Take on the...
Site Reliability Engineer @ Mclean, VA

22 hours ago

McLean, VA, United States CV Library Full time

Role: Site Reliability Engineer Location: Mclean or Richmond VA Type: Contract to hire Nice to have skills: Experience in Financial Domain Roles & Responsibilities: Experience with at least one of the following: Java, Python, or Go Experience working with AWS tools and services, DevOps environments Experience with agile practices 4+ years of site...
Lead Platform Engineer, Site Reliability Engineering

22 hours ago

McLean, VA, United States Capital One Full time

Center 3 (19075), United States of America, McLean, Virginia Lead Platform Engineer, Site Reliability Engineering (SRE) Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of...
Sr. Site Reliability Engineer

21 hours ago

Hawthorne, CA, United States SPACE EXPLORATION TECHNOLOGIES CORP Full time

SR. SITE RELIABILITY ENGINEER - TOP SECRET CLEARANCE As a Senior Site Reliability Engineer, you will architect, develop, and test key aspects of the infrastructure for an in-house solution for analysis, simulation, prototyping, and operation of software in support of all SpaceX flight systems. You will have full ownership of the automation and technical...
Site Reliability Engineer

1 month ago

Fairfax, VA, United States Apex Systems Full time

We are seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency’s (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better...
Sr. Reliability Engineer

2 days ago

Plainsboro Township, NJ, United States Integra LifeSciences Full time

Changing lives. Building Careers. Joining us is a chance for you to do important work that creates change and shapes the future of healthcare. Thinking differently is what we do best. To us, change equals opportunity. Every day, more than 4,000 of us are challenging what’s possible and making headway to help improve outcomes. Position: Sr. Reliability...
Sr. Site Reliability Engineer

2 weeks ago

Dallas, TX, United States Sygna LLC Full time

Job Title: Sr. Site Reliability Engineer Ready to apply Before you do, make sure to read all the details pertaining to this job in the description below. Contract Type: Contract to hire Location: Hybrid (Dallas Tx) Must Have and Metrics Technical Skills: Years of experience: 7+ Ability to collaborate with cross-functional teams, troubleshoot...
Senior Site Reliability Engineer

5 days ago

McLean, United States Mindlance Full time

We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will be responsible for ensuring the availability, performance, and scalability of our systems and infrastructure. With expertise in AWS and proficiency in Go, or Python, you will collaborate closely with development and operations teams to...
AWS Site Reliability Engineer

1 day ago

McLean, United States Booz Allen Hamilton Full time

Job Number: R0195476AWS Site Reliability Engineer The Opportunity: Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development-if you have a passion for making systems better, we need you! As a site...
Sr. Site Reliability Engineer

2 days ago

Chicago, IL, United States Datamaxis Full time

Location : Chicago, IL Position Type : Fulltime (3 days a week (Tue, Wed & Thu) onsite or more if needed) Salary : $125,000 to 140,000 (10% yearly bonus) Responsibilities: Manage and monitor systems and infrastructure hosted on-premises and Cloud. Good understanding of different layers of an application and system design - networking concepts, cloud...
Site Reliability Engineer

18 hours ago

Chicago, IL, United States WEX, Inc. Full time

The WEX Site Reliability Engineering (SRE) team is seeking an entry-level Site Reliability Engineer Level 1 who is passionate about learning and growing in the field of software development and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. The team will be part of the Benefits...
Site Reliability Engineer @ Mclean, VA

1 week ago

McLean, United States Diverse Lynx Full time

Role: Site Reliability Engineer Location: Mclean or Richmond VA Type: Contract to hire Nice to have skills: Experience in Financial Domain Roles & Responsibilities: Experience with at least one of the following - Java, Python or Go Experience working with AWS tools and services, DevOps environments Experience with agile practices 4+ Years of site...
Site Reliability Engineer

2 days ago

Sunnyvale, CA, United States Natcast, Inc. Full time

Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (NSTC) consortium, established by the CHIPS Act of the U.S. government. Working at Natcast represents an opportunity to help extend America’s leadership in...
Site Reliability Engineer

4 weeks ago

Annapolis Junction, MD, United States Maximus Full time

General information Job Posting Title Site Reliability Engineer Date Wednesday, October 16, 2024 City Annapolis Junction State MD Country United States Working time Full-time Description & Requirements Maximus is seeking a Site Reliability Engineer to provide expertise to a federal client in support of their mission critical systems in defense of our...
Site Reliability Engineer

4 weeks ago

Annapolis Junction, MD, United States Maximus Full time

General information ...
Site Reliability Engineer

1 month ago

Duluth, GA, United States BlueSky Resource Solutions Full time

Job Title: Site Reliability Engineer – ObservabilityOverview:We are seeking a Site Reliability Engineer III to develop and maintain our observability platform. This role focuses on ensuring the reliability, performance, and scalability of microservices, Kubernetes clusters, and cloud infrastructure. You'll collaborate with cross-functional teams to deliver...
Site Reliability Engineer

21 hours ago

Miami, FL, United States Royal Caribbean Group Full time

Site Reliability Engineer Journey with us! Combine your career goals and sense of adventure by joining our incredible team of employees at Royal Caribbean Group . We are proud to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world. We are proud to be the...
Associate Site Reliability Engineer/Site Reliability Engineer

2 days ago

Redwood City, CA, United States C3 AI Full time

We are looking for an Associate Site Reliability Engineer / Site Reliability Engineer to join our team at our HQ in Redwood City, CA. Responsibilities: Maximize system uptime and availability, ensuring functional and performance SLAs. Establish end-to-end monitoring and alerting on all critical aspects. Solve complex problems for critical services...
Site Reliability Engineer

4 weeks ago

Newton, MA, United States Intelliswift Software Full time

Title : Site Reliability EngineerLocation : Newton, MA HybridDuration : 6 MonthsPay rate : $38.73 per hour on W2We are seeking a skilled Site Reliability Engineer (SRE) Level 2 to join our dynamic team. The ideal candidate will have a strong technical background, excellent problem-solving skills, and a passion for enhancing system reliability and...

Americas

Europe

Asia / Oceania

Africa

Sr. Site Reliability Engineer