SRE/Performance Engineering
3 weeks ago
About Runloop
Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform eliminates friction in environment setup and dependencies, enabling teams to experiment, iterate, and deploy seamlessly. Were a small but dedicated team working to deliver a rock-solid platform that empowers innovation.
The Role
Were looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, observability, performance, and security of our core platformthe foundation upon which our users build. Youll work closely with engineering to maintain resilient systems that power our code sandboxes, while mentoring peers on reliability practices. This role blends deep operational expertise with a software engineering mindset.
What Youll Do
Design, operate, and improve production infrastructure on AWS, GCP, or Azure.
Define and monitor SLIs/SLOs, manage error budgets, and maintain observability with Prometheus, Grafana, and logging/tracing frameworks.
Build automation for deployments, scaling, and recoveryreducing toil and creating self-healing systems.
Lead incident response, rootcause analysis, and blameless postmortems.
Collaborate with developers to design scalable, reliable services.
Optimize distributed systems, networking, and sandbox performance.
Plan for capacity growth and support safe release/change management.
Mentor engineers on reliability and frontend distributed systems (CDNs, caching, client observability).
Qualifications
Proven experience as an SRE, DevOps Engineer, or similar role.
Strong programming skills (Python or Go preferred).
Deep knowledge of containerization (Docker, Kubernetes).
Expertise in infrastructure-as-code (Terraform or Pulumi).
Strong understanding of networking, Linux, and system security.
Handson experience with distributed systems and observability (metrics, logs, tracing).
Skilled in incident management, oncall rotations, and postmortem processes.
Ability to mentor and influence best practices across teams.
Bonus Points
~ Experience with chaos engineering, CI/CD for frontend delivery, or observability tools like Sentry, RUM, or synthetic monitoring.
Benefits
Competitive salary and equity.
Comprehensive health, dental, and vision insurance for you and your dependents.
Free lunch and snacks.
Opportunity to shape the future of AIdriven software engineering in a highimpact role.
Location
Onsite in San Francisco, CA (in office 4 days/week, optional 1 day WFH).
Join Us
If youre passionate about building resilient systems that empower developers and want to shape the future of AIdriven software engineering, wed love to hear from you. Join Runloop and help build the infrastructure that powers tomorrows AI.
Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity, or any other characteristic protected by law.
#J-18808-Ljbffr
-
SRE/Performance Engineering
3 weeks ago
San Francisco, CA, United States Mercor, Inc. Full timeMercor is at the intersection of labor markets and AI research. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge, experience, and context that cant be captured in code alone. Today,...
-
SRE/Performance Engineering
4 weeks ago
San Francisco, CA, United States Air Apps Full timeJoin to apply for the Site Reliability Engineer (SRE) role at Air Apps Check below to see if you have what is needed for this opportunity, and if so, make an application asap. Join to apply for the Site Reliability Engineer (SRE) role at Air Apps Get AI-powered advice on this job and more exclusive features. About Air Apps At Air Apps, we believe...
-
SRE/Performance Engineering
4 weeks ago
San Francisco, CA, United States Speak Full timeOur mission is to reinvent the way people learn, starting with language. Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around the world are actively trying to learn a language, but the best way to learn (oneonone tutoring) is hard to access at scale and hasnt been meaningfully improved...
-
SRE/Performance Engineering
3 weeks ago
San Francisco, CA, United States Canonical Full timeSite Reliability / Gitops Engineer Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the worlds...
-
SRE/Performance Engineering
4 weeks ago
San Francisco, CA, United States Canonical Full timeSite Reliability / Gitops Engineer – Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the...
-
SRE - Software Engineering
2 weeks ago
San Francisco, CA, United States Google Inc. Full timeSoftware Engineer III, Site Reliability Engineering If you think you are the right match for the following opportunity, apply after reading the complete description. Google San Francisco, CA, USA Apply X Applicants in San Francisco: Qualified applications with arrest or conviction records will be considered for employment in accordance with the San...
-
Lead Software Engineer
3 weeks ago
San Francisco, CA, United States Kontakt.io Full timeio is building the platform that care operations run on. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations. Easy to deploy and...
-
Lead Software Engineer
4 weeks ago
San Francisco, CA, United States Kontakt.io Full timeKontakt.io is building the platform that care operations run on. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations. Easy to deploy...
-
Lead Software Engineer
2 days ago
San Francisco, United States Kontakt.io Full timeKontakt.io is building the platform that care operations run on. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations. Easy to deploy...
-
Lead Software Engineer
57 minutes ago
San Francisco, United States Kontakt.io Full timeKontakt.io is building the platform that care operations run on.We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.Easy to deploy and...