Staff SRE Engineer
5 days ago
Recognized as the No. 1 site trusted by real estate professionals, has been at the forefront of online real estate for over 25 years, connecting buyers, sellers, and renters with trusted insights and expert guidance to find their perfect home. Through its robust suite of tools, not only makes a significant impact on the real estate industry at large, but for consumers, navigating the biggest purchase they will make in their life, by providing a user experience that is easy to use, easy to understand, and most of all, easy to make decisions.
Join us on our mission to empower more people to find their way home by breaking barriers to entry, making the right connections, and building confidence through expert guidance.
About the Role
We are seeking a Staff Site Reliability Engineer to join our newly formed Operations Excellence
organization, reporting to the Director, Operations Excellence. This foundational role will shape
the reliability, observability, and operational excellence of our platform infrastructure serving
millions of users. As a Staff SRE, you will be a technical leader and mentor who establishes
best practices, drives architectural decisions, and enables our 600+ engineers to deliver
exceptional customer experiences.
You will work on critical platform systems including EKS infrastructure, Skyway (CI/CD),
Frontdoor (Tyk API Gateway), Pantheon (Apollo GraphQL Federation), and our observability
stack, while establishing chaos engineering practices and driving cost optimization initiatives
with measurable ROI.
What You'll Do
Platform Reliability & Infrastructure
- Design and maintain highly available AWS infrastructure including EKS clusters, Fargate
(ECS), and multi-region architectures
- Own reliability of critical services: Skyway (CI/CD), Frontdoor (Tyk), Pantheon (Apollo
GraphQL), and supporting infrastructure
- Establish SLIs, SLOs, and error budgets for Tier 1/2/3 systems; lead architectural
reviews for reliability and cost-efficiency
- Drive adoption of reliability patterns including circuit breakers, graceful degradation, and
automated failover
Observability & Cost Optimization
- Build comprehensive observability using NewRelic for APM, distributed tracing, metrics,
and logging for rapid troubleshooting
- Create actionable dashboards and alerts that reduce MTTD and MTTR; establish
observability standards across teams
- Analyze infrastructure spend and implement FinOps practices including rightsizing,
reserved capacity, and resource lifecycle management
- Drive cost-conscious architecture decisions and optimize CI/CD spend (CircleCI, Argo
CD optimization)
Chaos Engineering & Incident Response
- Design chaos engineering experiments to identify system weaknesses; build frameworks
for safe production testing
- Lead game day exercises and disaster recovery simulations; create runbooks and
automation for resilience
- Participate in on-call rotation for critical systems; lead post-incident reviews and drive
systemic improvements
- Mentor engineers on incident response, communication, and escalation; contribute to
System Health Scorecard
Technical Leadership
- Serve as technical leader and mentor for the growing Operations Excellence team;
establish SRE principles and culture
- Partner with Platform Engineering, Quality Engineering, and product teams on reliability
initiatives
- Support security initiatives including AWS Secrets Manager migration and compliance
requirements (SOC 2, PCI, GDPR)
- Contribute to Developer Experience metrics and platform adoption goals
What You'll Bring
Experience & Expertise
- 8+ years in Site Reliability Engineering, DevOps, or Infrastructure Engineering with
proven track record improving system reliability
- Bachelor's degree or equivalent experience
- 5+ years hands-on experience with AWS (EKS, EC2, RDS, S3, CloudWatch, IAM) and
Kubernetes including multi-cluster management
- Strong programming skills (Python, Go, or Java) with infrastructure automation and
Infrastructure as Code experience (Terraform, CloudFormation)
- Production experience with observability tools (NewRelic, Datadog, Prometheus,
Grafana, Splunk) and distributed systems architecture
- Experience with CI/CD platforms and GitOps workflows (CircleCI, Argo CD, Jenkins);
on-call rotation and high-severity incident response
- Preferred: Chaos engineering tools, API Gateway technologies (Tyk/Kong), GraphQL
federation (Apollo), cost optimization initiatives with measurable ROI, FinOps principles
Technical Skills
- Cloud & Infrastructure: AWS (EKS, Fargate, Lambda, VPC, Route53, CloudFront),
Kubernetes, Docker, Istio Service Mesh
- CI/CD & GitOps: Argo CD, CircleCI, Jenkins, GitHub Actions
- Observability: NewRelic - APM, distributed tracing, metrics & logging; Splunk - logging
- IaC & Automation: Terraform, CloudFormation, Helm, Kustomize, Python/Go/Bash
- Platform Services: Tyk Gateway, Apollo GraphQL, AWS Secrets Manager, Vault
- Incident Management: OpsGenie, PagerDuty, ServiceNow
Leadership Qualities
- Excellent communication with ability to explain complex technical concepts to diverse
audiences
- Proven mentorship and collaboration skills across engineering, product, and business
teams
- Self-motivated and autonomous with systems thinking mindset focused on long-term
sustainability
- Data-driven decision making with customer-centric approach and empathy for developer
experience
Do the best work of your life at
Here, you'll partner with a diverse team of experts as you use leading-edge tech to empower everyone to meet a crucial goal: finding their way home. And you'll find your way home too. At , you'll bring your full self to work as you innovate with speed, serve our consumers, and champion your teammates. In return, we'll provide you with a warm, welcoming, and inclusive culture; intellectual challenges; and the development opportunities you need to grow.
Diversity is important to us, therefore, is an Equal Opportunity Employer regardless of age, color, national origin, race, religion, creed, gender, sex, sexual orientation, gender identity and/or expression, marital status, status as a disabled veteran and/or veteran of the Vietnam Era or any other characteristic protected by federal, state or local law. In addition, will provide reasonable accommodations for otherwise qualified disabled individuals.
-
SRE – TigerGraph
4 days ago
Austin, Texas, United States Galactic Minds INC Full timeTitle: SRE – TigerGraph / Neo4JLocation: Austin, TX / Sunnyvale, CA (Onsite preferred)Duration: C2CJob Description:We are seeking a highly skilledSite Reliability Engineer (SRE)with hands-on experience inTigerGraph or Neo4Jto support large-scale graph database environments. The ideal candidate will be responsible for ensuring high availability,...
-
Staff Site Reliability Engineer
2 weeks ago
Austin, Texas, United States FloSports, Inc. Full time $120,000 - $200,000 per yearFloSports is a world-class sports media company strategically positioned to be the essential destination for passionate sports fans, delighting them with live event coverage, breaking news, highlights, stats, rankings, and team and player profiles. We are growing Our Sports every day by continuing to invest in our ever-expanding ecosystem, which...
-
Data Platform SRE
2 weeks ago
Austin, Texas, United States Apple Full timePeople at Apple don't just build products - they craft the kind of experience that have revolutionized entire industries. The diverse collection of our people and their ideas inspire innovation in everything we do. Imagine what you could do here Join Apple, and help us leave the world better than we found it. The Apple Services Engineering (ASE) organization...
-
Python Developer with DevOps SRE
4 days ago
Austin, Texas, United States Infosys Full timeInfosys is seeking a skilled and motivated Python Developer with DevOps SRE (Site Reliability Engineer) automation and migration experience. You will be responsible for ensuring high availability, performance, and reliability across our infrastructure while driving automation initiatives and migrating our observability stack. This role requires...
-
Austin, Texas, United States Apple Full timeWe are looking for a Senior Site Reliability Engineer (SRE) with strong architectural experience to join JMET SRE Team. This individual will play a key role in designing and scaling reliable, secure, and high-performance infrastructure across our cloud and hybrid environments. You will be responsible for establishing reliability patterns, driving large-scale...
-
DevOps/SRE Leader
2 days ago
Austin, Texas, United States BitKernel Technology Inc. Full timeAbout BitkernelBitkernel Technology Inc. is redefining how the world experiences digital video. With a proven track record of innovation, we're on a mission to revolutionize streaming technology and build complete content ecosystems. Headquartered in Austin, Texas, with a growing office in Vancouver, B.C., we're building our future in North America.Powered...
-
Staff DevOps Engineer, Web
18 hours ago
Austin, Texas, United States General Motors Full timeJob DescriptionHybrid:This role is categorized as hybrid. This means the successful candidate is expected to report to Austin, TX three times per week.About The TeamWe are the Brands & Marketing Software Engineering Team within the Digital Products Engineering (DPE) organization at GM. As most customers first step in their car buying journey, our team...
-
Technical Delivery Lead
2 weeks ago
Austin, Texas, United States Virtasant Full timeTechnical Delivery Lead | (SRE Focus)Location/Time zone requirements: Must be based in the San Francisco Bay Area, with weekly visits to the client's headquarters.About VirtasantVirtasant is a fast-growing global consultancy transforming how technology services are delivered. We are a diverse team of cloud experts, builders, and operators. Since 2006, we've...
-
Austin, Texas, United States Apple Full time $100,000 - $150,000 per yearDo you have a passion for ensuring the reliability, scalability, and performance of critical services? Are you a highly motivated and expert engineer with a strong understanding of Site Reliability Engineering (SRE) principles and a desire to automate and improve processes? Join Apple's General and Administrative (G&A) Solutions Engineering team as a Service...
-
Site Reliability Engineering
21 hours ago
Austin, Texas, United States Apple Full timePeople at Apple don't just build products - they craft the kind of experience that have revolutionized entire industries. The diverse collection of our people and their ideas inspire innovation in everything we do. Imagine what you could do here Join Apple, and help us leave the world better than we found it. Apple Services Engineering (ASE) is responsible...