Principal Site Reliability Engineer

3 weeks ago

California, United States InStride Full time

Principal Site Reliability Engineer (SRE) Join to apply for the Principal Site Reliability Engineer (SRE) role at InStride. At InStride, people are our purpose. We believe that investing in people is the most powerful way to drive success—for individuals and organizations alike. As a public benefit corporation, we partner with leading employers to unlock opportunities for their employees, providing access to top‑tier education programs that align with their employees’ career goals and the company’s business goals. Our mission goes beyond skill‑building; we’re here to empower our partners’ employees to advance their careers, elevate their expertise, and achieve meaningful personal and professional growth. No matter the team you’re on, our dedication to the success of our partners and their employees is what drives us. Candidates must be located in one of the following states to be considered eligible for employment: AZ, CA, CO, CT, FL, GA, IL, IN, KS, LA, MD, MA, MI, MO, NV, NH, NJ, NY, PA, OH, OR, TX, VA, WA, WI. What we’re looking for This is a highly technical role for an individual contributor who thrives at the intersection of cloud architecture, automation, and reliability engineering. You will be the go‑to AWS expert for complex initiatives, setting technical direction and raising the bar for operational excellence across our platform. Skills we’d love to see you show off Cloud Architecture & Strategy: Design and optimize AWS environments that balance scalability, resilience, and cost efficiency for enterprise workloads. Technical Leadership & Mentorship: Serve as a trusted technical advisor, guiding engineers on best practices in Kubernetes, DevSecOps, and AWS‑native design patterns. Infrastructure as Code Mastery: Build reusable, version‑controlled IaC libraries with AWS CDK, Terraform, or CloudFormation to standardize deployments. Security & Compliance by Design: Enforce least‑privilege IAM, encryption‑by‑default, and policy‑as‑code guardrails to meet security and regulatory standards. Observability & Reliability Engineering: Define SLIs/SLOs, manage error budgets, and implement monitoring strategies with Prometheus, Grafana, and AWS‑native tools. CI/CD Excellence: Optimize automated pipelines with Harness and GitHub, enabling faster, safer, and more reliable software delivery. Networking & Resilience: Architect secure, performant VPCs, load balancing, and multi‑region failover strategies with AWS networking services. Automation & Self‑Service Enablement: Deliver developer‑friendly automation and Internal Developer Portal (IDP) capabilities that empower teams to provision infrastructure without SRE intervention. Who you are 10+ years of experience in SRE, DevOps, or Platform Engineering roles operating production AWS workloads. Hands‑on expertise with AWS EKS, Kubernetes networking, Helm, autoscaling frameworks (Karpenter/Cluster Autoscaler), serverless architectures, and API Gateways. Proven delivery of service mesh solutions (Istio, Linkerd, or AWS App Mesh) for secure and observable service‑to‑service communication. Proficiency with Infrastructure as Code (IaC) using AWS CDK (TypeScript preferred/Python), Terraform, or CloudFormation. Strong programming and automation skills in Go, Python, or TypeScript, with additional proficiency in Bash. Demonstrated experience implementing policy‑as‑code with OPA/Rego or similar tooling integrated into CI/CD pipelines. Solid understanding of SLI/SLO/error‑budget methodologies and hands‑on experience with monitoring and alerting stacks (Prometheus, Grafana, CloudWatch, Groundcover). Deep knowledge of AWS security best practices, including IAM policies, encryption, OS hardening, and compliance enforcement. Excellent communication skills with the ability to translate reliability metrics into business impact and guide incident/post‑mortem discussions. Experience mentoring engineers and influencing enterprise AWS and DevOps strategies without direct management responsibilities. Familiarity with Internal Developer Portals (Backstage, Port, Cortex) and self‑service automation is a strong plus. How you will create impact Elevate platform reliability: Design and operate multi‑region, fault‑tolerant systems that ensure InStride’s learning platform is always available for learners and partners. Advance automation at scale: Deliver Infrastructure as Code libraries, CI/CD pipelines, and self‑service capabilities that reduce operational toil and accelerate developer productivity. Champion security and compliance: Implement defense‑in‑depth strategies, policy‑as‑code guardrails, and proactive monitoring to protect sensitive data and maintain trust. Drive observability maturity: Define and enforce SLIs/SLOs, establish error‑budget policies, and build monitoring frameworks that inform release readiness and operational decisions. Enable seamless service connectivity: Deploy and manage service mesh solutions that secure, monitor, and optimize service‑to‑service communication across Kubernetes workloads. Influence technical direction: Partner with engineering and security stakeholders to shape InStride’s AWS strategy, ensuring scalability, resilience, and cost efficiency. Mentor and uplift engineers: Share expertise, lead design reviews, and guide teams toward modern DevOps and SRE practices, raising the technical bar across the organization. Compensation At InStride, final offer amounts are dependent on multiple factors including location, depth of experience, interview performance, and equity with other team members. We encourage you to talk with your recruiter to learn more about the total compensation and benefits available for this role. Compensation range: $165,000—$195,000 USD. Benefits 401(k) plan with company match Flexible vacation policy Paid family leave Best‑in‑class health care benefits And more Employees are eligible to enroll in 2,800+ online certificate and degree programs through our Step Forward program. We cover tuition upfront, regardless of course of study, degree type, or school, eligible to employees starting Day 1. InStride Diversity and Inclusion Statement At InStride, we foster a culture of belonging, support authenticity and intersectionality, and embrace our differences. We build a diverse pipeline of talent and ensure equitable access to opportunities, information and leadership. We celebrate diversity and are committed to creating an inclusive environment for all employees. If you have a disability or special need that requires accommodation, please let your recruiter know. Policies & Disclosure InStride recommends employees have their COVID vaccinations. InStride may require employees to have COVID vaccination before entering the office or attending any InStride‑related event in the future. However, we do not require this at this time. About InStride InStride is a human capital management company that helps organizations retain talent, upskill employees, and fill critical workforce roles through education programs. By breaking down barriers to learning, fostering career growth aligned with organizational goals, and simplifying program management, InStride delivers lasting impact. Partnering with forward‑thinking companies like Labcorp, Adidas, and SSM Health, InStride drives meaningful social and business outcomes by providing access to life‑changing education. Visit instride.com or follow InStride on LinkedIn for more information and up‑to‑date news. #J-18808-Ljbffr

Principal Reliability Engineer — Ground Systems

2 weeks ago

California, United States Blue Origin LLC Full time

A private aerospace company is seeking a Principal Reliability Engineer to oversee reliability for tooling and ground systems. The ideal candidate will possess significant engineering experience in safety-critical environments and lead cross-functional teams to foster improvements. Key responsibilities include driving reliability strategies and leading root...
Principal Reliability Engineer – Maintenance

2 weeks ago

California, United States Blue Origin LLC Full time

Principal Reliability Engineer – Maintenance & Ground Systems page is loaded## Principal Reliability Engineer – Maintenance & Ground Systemslocations: Space Coast, FLtime type: Full timeposted on: Posted Yesterdayjob requisition id: R56814Application close date:Applications will be accepted on an ongoing basis until the requisition is closed.At...
Site Reliability Engineer

5 days ago

California, United States Booz Allen Hamilton Full time

Site Reliability EngineerThe Opportunity: Engineering to make a system more resilient and efficient frees up time and money to build more capabilities. Whether you come from a background in network engineering, systems administration, or software development—if you have a passion for making systems better, we need you! As a Site Reliability Engineer on...
Site Reliability Engineer

3 weeks ago

California, United States Reliable Robotics Full time

We're building safety-enhancing technology for aviation that will save lives. Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally transformative to the way goods — and eventually people — move around the planet. We are a team of mission-driven engineers with experience across aerospace,...
Principal Engineer

2 weeks ago

California, United States AlixPartners Full time

At AlixPartners, we solve the most complex and critical challenges by moving quickly from analysis to action when it really matters; creating value that has a lasting impact on companies, their people, and the communities they serve. By understanding, respecting, and honoring the needs of our employees, clients, and communities, AlixPartners actively...
Principal or Sr. Principal Field Engineer

3 weeks ago

California, United States Northrop Grumman Corp. (AU) Full time

RELOCATION ASSISTANCE: Relocation assistance may be availableCLEARANCE TYPE: SecretTRAVEL: Yes, 10% of the TimeDescriptionAt Northrop Grumman, our employees have incredible opportunities to work on revolutionary systems that impact people's lives around the world today, and for generations to come. Our pioneering and inventive spirit has enabled us to be at...
Reliability Engineer

3 weeks ago

California, United States Lunar Energy Full time

Sr. Reliability Engineer Reliability Engineers at Lunar Energy will be responsible for ensuring product reliability throughout the entire lifecycle of our revolutionary home energy products. This includes providing input during the design phases, developing test plans, and assessing on-going field performance. We are looking for people who are passionate...
Principal Architect – Infrastructure Engineering

2 weeks ago

California, United States DDN Full time

Principal Architect – Infrastructure Engineering & DevOpsJoin to apply for the Principal Architect – Infrastructure Engineering & DevOps role at DDNContinue with Google Continue with GooglePrincipal Architect – Infrastructure Engineering & DevOpsJoin to apply for the Principal Architect – Infrastructure Engineering & DevOps role at DDNThis is an...
Senior Software Engineer

1 week ago

California, United States Jobgether Full time

Senior Software Engineer - Reliability (Remote) This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Software Engineer - Reliability (Remote) in California (USA).We are seeking a Senior Software Engineer specializing in Reliability to help design, implement, and operate systems that ensure cloud‑based...
Principal Systems Engineer

3 weeks ago

California, United States Insight Global Full time

Pay Rate: $176k - $240k (estimate) Job Description An employer sitting in Simi, CA is looking to hire a Principal Systems Engineer. The Principal Systems Engineer is considered a subject matter expert in the discipline. One demonstrates visionary & professional concepts in developing resolution to critical issues and broad design matters pertaining to the...

Americas

Europe

Asia / Oceania

Africa

Principal Site Reliability Engineer