Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Phaidra Full time

About Phaidra

Phaidra is a pioneering company in the field of industrial automation, leveraging AI-powered control systems to enable facilities to automatically learn and improve over time.

Our team has a proven track record of applying AI to complex problems, with achievements such as achieving superhuman performance with DeepMind's AlphaGo and reducing the energy required to cool Google's Data Centers by 40%.

We are driven by our core values of Transparency, Collaboration, Operational Excellence, Ownership, and Empathy, and we seek individuals who embody these values to join our team.

Job Description

We are seeking a highly skilled Site Reliability Engineer to join our Infrastructure Engineering team. As a Site Reliability Engineer, you will be responsible for building and maintaining world-class infrastructure, working with cloud services like AWS, Azure, and GCP, and applying SRE principles for observability, SLOs, automation, and change management.

You will have the opportunity to make an immediate impact with your work and guide the product and team as we grow. Our team is currently located throughout the USA, Canada, UK, Norway, Italy, Sweden, Spain, Portugal, Japan, Singapore, and India.

Responsibilities

Help build and maintain infrastructure for large-scale data ingestion and processing, distributed model training, evaluation, and inference, automating the end-to-end system for continuous improvement and deployment, developer environments, and build systems.
Work with cloud services like AWS, Azure, GCP, and cloud native technologies like Kubernetes, Prometheus, and gRPC.
Help build CI/CD infrastructure, pipelines, and take part in DevOps duties.
Apply SRE principles for observability, SLOs, automation, and change management.
Write and maintain tooling and documentation for infrastructure, supported applications, and processes.
Build and maintain cross-functional relationships with internal teams to drive initiatives.

Key Qualifications

5+ years of work experience.
Bachelor's or Master's in Computer Science, or equivalent experience.
Proven experience automating cloud and networking infrastructure on AWS, GCP, or Azure.
Good understanding of Linux-based operating systems, containerization, and orchestration technologies like Docker and Kubernetes.
Experience with Terraform or other configuration management tools like Jsonnet, Kapitan, Helm, or Kustomize.
Experience with monitoring stacks such as Prometheus, Influx, Stackdriver, or Zabbix.
Programming experience, ideally with Python, Go, or Bash scripting.
Experience with writing Kubernetes Operators.
Good understanding of DevOps, SRE principles, and platform engineering.
Share our company values: curiosity, ownership, transparency & directness, outcome-based performance, and customer empathy.

Preferred Skills & Experience

Expertise with multi and hybrid cloud environments.
Experience with software engineering.
Expertise with some parts of our tech stack is a big plus.
Experience in automating scalable multi-tenant systems architectures with high availability, fault tolerance, performance tuning, monitoring, and statistics/metrics collection.

Our Stack

Languages - (Backend) Python, Go; (Frontend) JavaScript/TypeScript, React; Customer SDK & Clients - C# .NET
PyTorch
Cypress
Docker, Kubernetes, Terraform & Kapitan
Custom Kubernetes Operators (with kopf)
Gitlab CI, ArgoCD, Atlantis, Vercel
GCP - GKE, PubSub, CloudSQL, BigTable, Postgres, etc.
REST & gRPC micro-services
Poetry, Pantsbuild

Onboarding

In your first 30 days...

You will be immersed in an onboarding program that introduces you to Phaidra and our product.
You will spend time in the Engineering org, learning how the teams operate, interact, and approach problems.
You will read various parts of our handbook and familiarize yourself with the documentation culture at Phaidra.
You will set up your development environment and start working on an onboarding exercise that will introduce you to various parts of our code and infrastructure base.
You will learn about how we use agile and be able to navigate our sprint boards and backlogs.
You will learn about various team standards and development & release processes.
You will start to learn about our system architecture and infrastructure.

By your first 60 days...

You will have a solid understanding of what Phaidra does and how we do it.
You will have met with team members across Phaidra and started building relationships that will help you be successful at your job.
You will have completed the onboarding exercise and will be on your way to completing your first production task.

By your first 90 days...

You will have been fully integrated in the team and with team members across the company.
You will get a more in-depth understanding of our system architecture and infrastructure.
You will have completed your first on-call experience helping monitor and improve our production environments.
You will have become an expert with our tooling.
You will have started to contribute to knowledge sharing throughout Phaidra.

General Interview Process

All of our interviews are held via Google Meet, and an active camera connection is required.

Initial screening interview with a People Operations team member (30 minutes)
Meeting with Director, Infrastructure Engineering (30 minutes)
Take Home Exercise
Meeting with Site Reliability Engineers (60 minutes)
Meeting with VP of Engineering (60 minutes)
Culture fit interview with Phaidra's co-founders (30 minutes)

Base Salary Range

US Residents: $92,800-$178,000/year
Canada Residents: CA$113,600-CA$180,000/year

This position will also include equity.

These are best faith estimates of the base salary range for this position. It is important to note that the salary bands provided are inclusive of multiple levels and the actual candidate level will be determined during the interview process. In addition to this, other factors such as experience, education, and location will be taken into consideration when deciding final compensation.

Benefits & Perks

Fast-paced and team-oriented environment where you will be instrumental in the direction of the company.
Phaidra is a 100% remote company with a digital nomad policy.
Competitive compensation & equity.
Outsized responsibilities & professional development.
Training is foundational; functional, customer immersion, and development training.
Medical, dental, and vision insurance (exact benefits vary by region).
Unlimited paid time off, with a minimum of 20 days off per year requirement.
Paid parental leave (exact benefits vary by region).
Home office setup allowance, coworking space stipend, and company MacBook.

*Please note that Phaidra's benefits and perks listed above do not apply to temporary employees such as interns.

On being Remote

We are thoughtful about remote collaboration. We look to the pioneers - like Gitlab - for inspiration and best practices to create a stellar remote work environment. We have a documentation-first culture and actively practice asynchronous communication in everything we do. Our team stays connected through tools like Slack and video chat. Most teams meet daily, and we have dedicated all-hands meetings weekly to build strong relationships. We hold virtual team building events once per quarter - and even hold virtual socials to watch rocket launches We have had all-company summits in locations like Seattle, Athens, Goa, and Barcelona.

Equal Opportunity Employment

Phaidra is an Equal Opportunity Employer; employment with Phaidra is governed on the basis of merit, competence, and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability, or any other legally protected status. We welcome diversity and strive to maintain an inclusive environment for all employees. If you need assistance with completing the application process, please contact us at

E-Verify Notice

Phaidra participates in E-Verify, an employment authorization database provided through the U.S. Department of Homeland Security (DHS) and Social Security Administration (SSA). As required by law, we will provide the SSA and, if necessary, the DHS, with information from each new employee's Form I-9 to confirm work authorization for those residing in the United States.

Additional information about E-Verify can be found here.

#LI-Remote

WE DO NOT ACCEPT APPLICATIONS FROM RECRUITERS.

Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Sogeti Full time

Job Title: Site Reliability EngineerAbout the Role:We are seeking an experienced Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using Azure or...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States HireIO Inc Full time

Job Title: Site Reliability EngineerHireIO Inc is seeking a skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and performance of our distributed systems.Key Responsibilities:Design and implement scalable and reliable systemsCollaborate with cross-functional...
Site Reliability Engineer

3 weeks ago

Seattle, Washington, United States Oracle Full time

About the Role:Oracle is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design, develop, and deploy software to improve the availability, scalability, and efficiency of...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Sogeti Full time

Site Reliability Engineer **Job Summary** We are seeking an experienced Site Reliability Engineer to join our team. As a key member of our operations team, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure. **Key Responsibilities** * Design, implement, and maintain scalable and reliable cloud...
Site Reliability Engineer

3 weeks ago

Seattle, Washington, United States Oracle Full time

About the Role:We are seeking a highly skilled Site Reliability Engineer to join our team at Oracle. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud infrastructure. You will work closely with our development teams to design, implement, and operate large-scale distributed...
Site Reliability Engineer

3 weeks ago

Seattle, Washington, United States HireIO Inc Full time

Job SummaryAt HireIO Inc, we are seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and reliability of our Ads systems. This includes designing, analyzing, and troubleshooting large-scale distributed systems, as well as developing tools and...
Senior Site Reliability Engineer

3 weeks ago

Seattle, Washington, United States Diverse Lynx Full time

Job Title: Sr. Site Reliability EngineerLocation: RemoteDuration: 12+ Months contractJob Description:We are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the availability, reliability, and performance of our applications and services.You will work...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Tik Tok Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Data Platform Team at TikTok. As a key member of our team, you will be responsible for designing, building, and operating large-scale, massively distributed services and infrastructures.Key ResponsibilitiesDesign and implement reliable, scalable, and robust big data systems...
Site Reliability Engineer

2 months ago

Seattle, Washington, United States Tik Tok Full time

About the RoleThis is a Site Reliability Engineer position, focusing on the data pipeline reliability for the Video Platform team in USDS.Data SREs monitor data and keep production batch and real-time processing jobs up and running with the highest level of availability, ensuring our users have the freshest, complete, and correct data...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Apple Full time

Job SummaryApple is seeking a highly skilled and motivated Security Site Reliability Engineer (SRE) to join our dynamic and growing team.Key ResponsibilitiesEnsure the security, reliability, and scalability of our systems and infrastructure.Collaborate with cross-functional teams to design, implement, and maintain security measures, incident response...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Apple Full time

Job DescriptionWe are seeking a highly skilled Security Site Reliability Engineer (SRE) to join our dynamic and growing team at Apple. As a Security SRE, you will play a critical role in ensuring the security, reliability, and scalability of our systems and infrastructure.You will collaborate with cross-functional teams to design, implement, and maintain...
Site Reliability Engineer Lead

4 weeks ago

Seattle, Washington, United States Sogeti Full time

Job Title: Lead Site Reliability Engineer Job Summary: We are seeking a highly skilled Lead Site Reliability Engineer to join our team at Sogeti. The successful candidate will be responsible for developing and maintaining cloud observability systems, building monitoring and alerting systems, and optimizing system performance. Key Responsibilities: *...
Senior Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Saxon Global Full time

Job SummaryStarbucks is seeking a highly skilled Senior Site Reliability Engineer to join their Data Platform Services team. This team is responsible for maintaining and improving the data platform that many Starbucks services rely on.Key ResponsibilitiesEnsure the health and stability of production systemsDevelop and implement monitoring dashboards and...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Hireio, Inc. Full time

Job OverviewHireio, Inc. is seeking a highly skilled Site Reliability Engineer to join our team. As a key member of our Ads systems team, you will be responsible for ensuring the reliability, scalability, and operability of our services.Key ResponsibilitiesDesign and implement scalable and reliable systems architectureCollaborate with cross-functional teams...
Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States Tik Tok Full time

About the RoleTikTok is seeking a skilled Site Reliability Engineer to join our Applied Machine Learning (AML) team. As a key member of our team, you will design, build, and maintain highly available, scalable, and fault-tolerant systems that support our recommendation engine.ResponsibilitiesDesign and implement large-scale systems that ensure high...
Site Reliability Engineer III

4 weeks ago

Seattle, Washington, United States F5 Networks Full time

Job SummaryF5 Networks is seeking a highly skilled Site Reliability Engineer III to join our team. As a Site Reliability Engineer III, you will be responsible for ensuring the reliability, availability, and scalability of critical systems and SaaS platforms.Key ResponsibilitiesApply modern engineering principles and practices to operational functions and...
Site Reliability Engineering Lead

4 weeks ago

Seattle, Washington, United States DAT Freight Solutions Full time

About DAT Freight SolutionsDAT Freight Solutions is a leading provider of transportation management software and services. We are seeking a highly skilled Site Reliability Engineering Lead to join our team.The successful candidate will be responsible for leading major technical initiatives and mentoring engineers to enhance their skills. They will work...
Site Reliability Engineer Manager, Foundation

3 weeks ago

Seattle, Washington, United States Qualtrics Full time

We are looking for a Site Reliability Engineer Manager to lead our Gov1 environment in the Foundation Product Unit.This person will be responsible for managing a team of US-based Support Engineers who will support Gov1 activities for non-US teams in the Foundation org.The ideal candidate will have experience in site reliability engineering, team management,...
Site Reliability Engineer Manager, Foundation

4 weeks ago

Seattle, Washington, United States Qualtrics Full time

About the RoleWe are seeking a highly skilled Site Reliability Engineer Manager to lead our SRE team in the Foundation Product Unit. As a key member of our team, you will be responsible for ensuring the reliability and scalability of our Gov1 environment.As a Site Reliability Engineer Manager, you will be responsible for leading a team of SREs, collaborating...
Senior Site Reliability Engineer

4 weeks ago

Seattle, Washington, United States F5 Networks Full time

About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at F5 Networks. As a key member of our engineering team, you will be responsible for ensuring the reliability and performance of our systems.Key ResponsibilitiesDesign and implement scalable and efficient system architecturesDevelop and maintain monitoring and...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer