Lead Site Reliability Engineer

2 days ago

Washington, DC, United States Talent Vine Full time

About Our Client

Our client is redefining how modern defense technology is delivered. Based in Washington, D.C., they are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. They provide full-spectrum national security solutions that combine secure infrastructure, cleared talent, and mission-ready software to meet evolving defense challenges.

Their services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities-ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting-edge cyber defense.

They are led by technologists and veterans with firsthand mission experience, enabling deep understanding of both operational realities and the innovation required to succeed. Their approach is agile and outcome-based, delivering results in weeks rather than months whenever possible.

At our client, people, integrity, and excellence come first. They foster an environment where innovation thrives in support of mission-critical requirements. Team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on some of the nation's most critical projects.
Core Values

Innovation & Responsiveness: Pushing beyond legacy models with efficient, tech-led solutions built to scale and evolve
Trusted Performance: Security, compliance, and deep experience delivering in demanding environments guide all work
Mission-Focused Expertise: From veteran leadership to cleared engineers, the team understands both the technology and the mission

About the Role

As the Lead Site Reliability Engineer for a major compute and AI infrastructure engagement, you will be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector.

You will lead the deployment, management, and automation of a high-performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a nine-figure government program.

This is a hands-on engineering leadership role that bridges physical infrastructure and modern DevOps automation-ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows.
What You'll Do

Lead infrastructure design, deployment, and operations for large-scale hardware clusters across secure and distributed environments
Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays
Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms
Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)
Operate and maintain distributed networking meshes across classified and unclassified domains
Implement and manage out-of-band management tools (IPMI, iDRAC, BMC, etc.) for remote troubleshooting and control
Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads
Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance
Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement
Mentor junior engineers and establish operational best practices as the program scales

What You'll Bring

3+ years of experience in site reliability, systems engineering, or hardware operations roles
Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting
Strong Linux systems administration experience, including imaging and automated deployment
Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes
Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines)
Experience configuring and managing networking and mesh architectures
Direct experience with NVIDIA GPUs, CUDA, and AI/ML frameworks
Proficiency with out-of-band management tools (IPMI/iDRAC)
Certifications: Linux+ and Security+ (required or in progress)
Excellent communication, documentation, and problem-solving skills
Clearance: Active TS/SCI required

Bonus Points

Experience operating in secure DoD or intelligence environments
Familiarity with Palantir platforms or other government data systems
Experience supporting AI/ML infrastructure in production or tactical settings
Experience tuning and monitoring HPC or GPU-accelerated clusters

Site Reliability Engineer

3 days ago

Washington, DC, United States Piper Companies Full time

Zachary Piper Solutions is seeking an experienced Site Reliability Engineer (SRE) to support the deployment and sustainment of systems across classified, air-gapped, and government cloud environments . This role blends operations, security, and reliability engineering , and is well-suited for engineers who excel in secure deployments, classified cloud...
Site Reliability Engineer

3 days ago

Washington, DC, United States Piper Companies Full time

Zachary Piper Solutions is seeking an experienced Site Reliability Engineer (SRE) to support the deployment and sustainment of systems across classified, air-gapped, and government cloud environments . This role blends operations, security, and reliability engineering , and is well-suited for engineers who excel in secure deployments, classified cloud...
Site Reliability Engineer

1 day ago

Washington, DC, United States Mount Indie Full time

We are defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century's most innovative companies to the defense industry. We are changing how military systems are designed, built and sold. Our company is a family of systems...
Site Reliability Engineer

3 days ago

Washington, DC, United States Mount Indie Full time

We are defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century's most innovative companies to the defense industry. We are changing how military systems are designed, built and sold. Our company is a family of systems...
Site Reliability Engineer

2 days ago

Washington, DC, United States CyRAD Solutions Full time

About the job Site Reliability Engineer - SRE Strategic Site Reliability Engineer: Global Network Orchestration PlatformThe Opportunity: Design the core reliability platform for the final frontier of space Mesh networking. This is a strategic, high-impact mandate within a high-growth, fast-paced startup, building the next generation of software-defined...
Site Reliability Engineer

1 day ago

Washington, DC, United States CyRAD Solutions Full time

About the job Site Reliability Engineer - SRE Strategic Site Reliability Engineer: Global Network Orchestration PlatformThe Opportunity: Design the core reliability platform for the final frontier of space Mesh networking. This is a strategic, high-impact mandate within a high-growth, fast-paced startup, building the next generation of software-defined...
Site Reliability Engineer

2 days ago

Washington, DC, United States Piper Companies Full time

Zachary Piper Solutions is seeking a Site Reliability Engineer - Configuration. This position is in support of a contract with the Department of Energy and the National Nuclear Security Agency OCIO. The NNSA has multiple offices and facilities across the United States, with its headquarters in Washington D.C. These offices and facilities are responsible for...
Site Reliability Engineer

2 weeks ago

Washington, DC, United States Piper Companies Full time

Zachary Piper Solutions is seeking a Site Reliability Engineer - Configuration. This position is in support of a contract with the Department of Energy and the National Nuclear Security Agency OCIO. The NNSA has multiple offices and facilities across the United States, with its headquarters in Washington D.C. These offices and facilities are responsible for...
Site Reliability Engineer, Platform Discovery

2 weeks ago

Washington, DC, United States Anduril Industries Full time

Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century's most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril's...
Manager, Site Reliability Engineer

3 days ago

Washington, DC, United States Capital One Full time

Manager, Site Reliability Engineer (Global Payment Network)Do you love building and pioneering in the technology space? Do you enjoy solving complex business problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of makers, breakers, doers and disruptors, who love to solve real...

Americas

Europe

Asia / Oceania

Africa

Lead Site Reliability Engineer