Site Reliability Engineer
4 weeks ago
Phaidra is a pioneering company in the field of industrial automation, leveraging AI-powered control systems to enable facilities to automatically learn and improve over time.
Our team has a proven track record of applying AI to complex problems, with achievements such as achieving superhuman performance with DeepMind's AlphaGo and reducing the energy required to cool Google's Data Centers by 40%.
We are driven by our core values of Transparency, Collaboration, Operational Excellence, Ownership, and Empathy, and we seek individuals who embody these values to join our team.
Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our Infrastructure Engineering team. As a Site Reliability Engineer, you will be responsible for building and maintaining world-class infrastructure, working with cloud services like AWS, Azure, and GCP, and applying SRE principles for observability, SLOs, automation, and change management.
You will have the opportunity to make an immediate impact with your work and guide the product and team as we grow. Our team is currently located throughout the USA, Canada, UK, Norway, Italy, Sweden, Spain, Portugal, Japan, Singapore, and India.
Responsibilities- Help build and maintain infrastructure for large-scale data ingestion and processing, distributed model training, evaluation, and inference, automating the end-to-end system for continuous improvement and deployment, developer environments, and build systems.
- Work with cloud services like AWS, Azure, GCP, and cloud native technologies like Kubernetes, Prometheus, and gRPC.
- Help build CI/CD infrastructure, pipelines, and take part in DevOps duties.
- Apply SRE principles for observability, SLOs, automation, and change management.
- Write and maintain tooling and documentation for infrastructure, supported applications, and processes.
- Build and maintain cross-functional relationships with internal teams to drive initiatives.
- 5+ years of work experience.
- Bachelor's or Master's in Computer Science, or equivalent experience.
- Proven experience automating cloud and networking infrastructure on AWS, GCP, or Azure.
- Good understanding of Linux-based operating systems, containerization, and orchestration technologies like Docker and Kubernetes.
- Experience with Terraform or other configuration management tools like Jsonnet, Kapitan, Helm, or Kustomize.
- Experience with monitoring stacks such as Prometheus, Influx, Stackdriver, or Zabbix.
- Programming experience, ideally with Python, Go, or Bash scripting.
- Experience with writing Kubernetes Operators.
- Good understanding of DevOps, SRE principles, and platform engineering.
- Share our company values: curiosity, ownership, transparency & directness, outcome-based performance, and customer empathy.
- Expertise with multi and hybrid cloud environments.
- Experience with software engineering.
- Expertise with some parts of our tech stack is a big plus.
- Experience in automating scalable multi-tenant systems architectures with high availability, fault tolerance, performance tuning, monitoring, and statistics/metrics collection.
- Languages - (Backend) Python, Go; (Frontend) JavaScript/TypeScript, React; Customer SDK & Clients - C# .NET
- PyTorch
- Cypress
- Docker, Kubernetes, Terraform & Kapitan
- Custom Kubernetes Operators (with kopf)
- Gitlab CI, ArgoCD, Atlantis, Vercel
- GCP - GKE, PubSub, CloudSQL, BigTable, Postgres, etc.
- REST & gRPC micro-services
- Poetry, Pantsbuild
In your first 30 days...
- You will be immersed in an onboarding program that introduces you to Phaidra and our product.
- You will spend time in the Engineering org, learning how the teams operate, interact, and approach problems.
- You will read various parts of our handbook and familiarize yourself with the documentation culture at Phaidra.
- You will set up your development environment and start working on an onboarding exercise that will introduce you to various parts of our code and infrastructure base.
- You will learn about how we use agile and be able to navigate our sprint boards and backlogs.
- You will learn about various team standards and development & release processes.
- You will start to learn about our system architecture and infrastructure.
By your first 60 days...
- You will have a solid understanding of what Phaidra does and how we do it.
- You will have met with team members across Phaidra and started building relationships that will help you be successful at your job.
- You will have completed the onboarding exercise and will be on your way to completing your first production task.
By your first 90 days...
- You will have been fully integrated in the team and with team members across the company.
- You will get a more in-depth understanding of our system architecture and infrastructure.
- You will have completed your first on-call experience helping monitor and improve our production environments.
- You will have become an expert with our tooling.
- You will have started to contribute to knowledge sharing throughout Phaidra.
All of our interviews are held via Google Meet, and an active camera connection is required.
- Initial screening interview with a People Operations team member (30 minutes)
- Meeting with Director, Infrastructure Engineering (30 minutes)
- Take Home Exercise
- Meeting with Site Reliability Engineers (60 minutes)
- Meeting with VP of Engineering (60 minutes)
- Culture fit interview with Phaidra's co-founders (30 minutes)
- US Residents: $92,800-$178,000/year
- Canada Residents: CA$113,600-CA$180,000/year
This position will also include equity.
These are best faith estimates of the base salary range for this position. It is important to note that the salary bands provided are inclusive of multiple levels and the actual candidate level will be determined during the interview process. In addition to this, other factors such as experience, education, and location will be taken into consideration when deciding final compensation.
Benefits & Perks- Fast-paced and team-oriented environment where you will be instrumental in the direction of the company.
- Phaidra is a 100% remote company with a digital nomad policy.
- Competitive compensation & equity.
- Outsized responsibilities & professional development.
- Training is foundational; functional, customer immersion, and development training.
- Medical, dental, and vision insurance (exact benefits vary by region).
- Unlimited paid time off, with a minimum of 20 days off per year requirement.
- Paid parental leave (exact benefits vary by region).
- Home office setup allowance, coworking space stipend, and company MacBook.
*Please note that Phaidra's benefits and perks listed above do not apply to temporary employees such as interns.
On being RemoteWe are thoughtful about remote collaboration. We look to the pioneers - like Gitlab - for inspiration and best practices to create a stellar remote work environment. We have a documentation-first culture and actively practice asynchronous communication in everything we do. Our team stays connected through tools like Slack and video chat. Most teams meet daily, and we have dedicated all-hands meetings weekly to build strong relationships. We hold virtual team building events once per quarter - and even hold virtual socials to watch rocket launches We have had all-company summits in locations like Seattle, Athens, Goa, and Barcelona.
Equal Opportunity EmploymentPhaidra is an Equal Opportunity Employer; employment with Phaidra is governed on the basis of merit, competence, and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability, or any other legally protected status. We welcome diversity and strive to maintain an inclusive environment for all employees. If you need assistance with completing the application process, please contact us at
E-Verify NoticePhaidra participates in E-Verify, an employment authorization database provided through the U.S. Department of Homeland Security (DHS) and Social Security Administration (SSA). As required by law, we will provide the SSA and, if necessary, the DHS, with information from each new employee's Form I-9 to confirm work authorization for those residing in the United States.
Additional information about E-Verify can be found here.
#LI-Remote
WE DO NOT ACCEPT APPLICATIONS FROM RECRUITERS.
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Sogeti Full timeJob Title: Site Reliability EngineerAbout the Role:We are seeking an experienced Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using Azure or...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States HireIO Inc Full timeJob Title: Site Reliability EngineerHireIO Inc is seeking a skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and performance of our distributed systems.Key Responsibilities:Design and implement scalable and reliable systemsCollaborate with cross-functional...
-
Site Reliability Engineer
3 weeks ago
Seattle, Washington, United States Oracle Full timeAbout the Role:Oracle is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Design, develop, and deploy software to improve the availability, scalability, and efficiency of...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Sogeti Full timeSite Reliability Engineer **Job Summary** We are seeking an experienced Site Reliability Engineer to join our team. As a key member of our operations team, you will be responsible for ensuring the reliability and scalability of our cloud-based infrastructure. **Key Responsibilities** * Design, implement, and maintain scalable and reliable cloud...
-
Site Reliability Engineer
3 weeks ago
Seattle, Washington, United States Oracle Full timeAbout the Role:We are seeking a highly skilled Site Reliability Engineer to join our team at Oracle. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud infrastructure. You will work closely with our development teams to design, implement, and operate large-scale distributed...
-
Site Reliability Engineer
3 weeks ago
Seattle, Washington, United States HireIO Inc Full timeJob SummaryAt HireIO Inc, we are seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the availability, scalability, and reliability of our Ads systems. This includes designing, analyzing, and troubleshooting large-scale distributed systems, as well as developing tools and...
-
Senior Site Reliability Engineer
3 weeks ago
Seattle, Washington, United States Diverse Lynx Full timeJob Title: Sr. Site Reliability EngineerLocation: RemoteDuration: 12+ Months contractJob Description:We are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the availability, reliability, and performance of our applications and services.You will work...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Tik Tok Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Data Platform Team at TikTok. As a key member of our team, you will be responsible for designing, building, and operating large-scale, massively distributed services and infrastructures.Key ResponsibilitiesDesign and implement reliable, scalable, and robust big data systems...
-
Site Reliability Engineer
2 months ago
Seattle, Washington, United States Tik Tok Full timeAbout the RoleThis is a Site Reliability Engineer position, focusing on the data pipeline reliability for the Video Platform team in USDS.Data SREs monitor data and keep production batch and real-time processing jobs up and running with the highest level of availability, ensuring our users have the freshest, complete, and correct data...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Apple Full timeJob SummaryApple is seeking a highly skilled and motivated Security Site Reliability Engineer (SRE) to join our dynamic and growing team.Key ResponsibilitiesEnsure the security, reliability, and scalability of our systems and infrastructure.Collaborate with cross-functional teams to design, implement, and maintain security measures, incident response...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Apple Full timeJob DescriptionWe are seeking a highly skilled Security Site Reliability Engineer (SRE) to join our dynamic and growing team at Apple. As a Security SRE, you will play a critical role in ensuring the security, reliability, and scalability of our systems and infrastructure.You will collaborate with cross-functional teams to design, implement, and maintain...
-
Site Reliability Engineer Lead
4 weeks ago
Seattle, Washington, United States Sogeti Full timeJob Title: Lead Site Reliability Engineer Job Summary: We are seeking a highly skilled Lead Site Reliability Engineer to join our team at Sogeti. The successful candidate will be responsible for developing and maintaining cloud observability systems, building monitoring and alerting systems, and optimizing system performance. Key Responsibilities: *...
-
Senior Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Saxon Global Full timeJob SummaryStarbucks is seeking a highly skilled Senior Site Reliability Engineer to join their Data Platform Services team. This team is responsible for maintaining and improving the data platform that many Starbucks services rely on.Key ResponsibilitiesEnsure the health and stability of production systemsDevelop and implement monitoring dashboards and...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Hireio, Inc. Full timeJob OverviewHireio, Inc. is seeking a highly skilled Site Reliability Engineer to join our team. As a key member of our Ads systems team, you will be responsible for ensuring the reliability, scalability, and operability of our services.Key ResponsibilitiesDesign and implement scalable and reliable systems architectureCollaborate with cross-functional teams...
-
Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States Tik Tok Full timeAbout the RoleTikTok is seeking a skilled Site Reliability Engineer to join our Applied Machine Learning (AML) team. As a key member of our team, you will design, build, and maintain highly available, scalable, and fault-tolerant systems that support our recommendation engine.ResponsibilitiesDesign and implement large-scale systems that ensure high...
-
Site Reliability Engineer III
4 weeks ago
Seattle, Washington, United States F5 Networks Full timeJob SummaryF5 Networks is seeking a highly skilled Site Reliability Engineer III to join our team. As a Site Reliability Engineer III, you will be responsible for ensuring the reliability, availability, and scalability of critical systems and SaaS platforms.Key ResponsibilitiesApply modern engineering principles and practices to operational functions and...
-
Site Reliability Engineering Lead
4 weeks ago
Seattle, Washington, United States DAT Freight Solutions Full timeAbout DAT Freight SolutionsDAT Freight Solutions is a leading provider of transportation management software and services. We are seeking a highly skilled Site Reliability Engineering Lead to join our team.The successful candidate will be responsible for leading major technical initiatives and mentoring engineers to enhance their skills. They will work...
-
Site Reliability Engineer Manager, Foundation
3 weeks ago
Seattle, Washington, United States Qualtrics Full timeWe are looking for a Site Reliability Engineer Manager to lead our Gov1 environment in the Foundation Product Unit.This person will be responsible for managing a team of US-based Support Engineers who will support Gov1 activities for non-US teams in the Foundation org.The ideal candidate will have experience in site reliability engineering, team management,...
-
Site Reliability Engineer Manager, Foundation
4 weeks ago
Seattle, Washington, United States Qualtrics Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer Manager to lead our SRE team in the Foundation Product Unit. As a key member of our team, you will be responsible for ensuring the reliability and scalability of our Gov1 environment.As a Site Reliability Engineer Manager, you will be responsible for leading a team of SREs, collaborating...
-
Senior Site Reliability Engineer
4 weeks ago
Seattle, Washington, United States F5 Networks Full timeAbout the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at F5 Networks. As a key member of our engineering team, you will be responsible for ensuring the reliability and performance of our systems.Key ResponsibilitiesDesign and implement scalable and efficient system architecturesDevelop and maintain monitoring and...