SRE/Performance Engineering

3 weeks ago


San Francisco CA, United States Rethink recruit Full time

About Runloop
Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform eliminates friction in environment setup and dependencies, enabling teams to experiment, iterate, and deploy seamlessly. Were a small but dedicated team working to deliver a rock-solid platform that empowers innovation.
The Role
Were looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, observability, performance, and security of our core platformthe foundation upon which our users build. Youll work closely with engineering to maintain resilient systems that power our code sandboxes, while mentoring peers on reliability practices. This role blends deep operational expertise with a software engineering mindset.
What Youll Do
Design, operate, and improve production infrastructure on AWS, GCP, or Azure.
Define and monitor SLIs/SLOs, manage error budgets, and maintain observability with Prometheus, Grafana, and logging/tracing frameworks.
Build automation for deployments, scaling, and recoveryreducing toil and creating self-healing systems.
Lead incident response, rootcause analysis, and blameless postmortems.
Collaborate with developers to design scalable, reliable services.
Optimize distributed systems, networking, and sandbox performance.
Plan for capacity growth and support safe release/change management.
Mentor engineers on reliability and frontend distributed systems (CDNs, caching, client observability).
Qualifications
Proven experience as an SRE, DevOps Engineer, or similar role.
Strong programming skills (Python or Go preferred).
Deep knowledge of containerization (Docker, Kubernetes).
Expertise in infrastructure-as-code (Terraform or Pulumi).
Strong understanding of networking, Linux, and system security.
Handson experience with distributed systems and observability (metrics, logs, tracing).
Skilled in incident management, oncall rotations, and postmortem processes.
Ability to mentor and influence best practices across teams.
Bonus Points
~ Experience with chaos engineering, CI/CD for frontend delivery, or observability tools like Sentry, RUM, or synthetic monitoring.

Benefits
Competitive salary and equity.
Comprehensive health, dental, and vision insurance for you and your dependents.
Free lunch and snacks.
Opportunity to shape the future of AIdriven software engineering in a highimpact role.
Location
Onsite in San Francisco, CA (in office 4 days/week, optional 1 day WFH).
Join Us
If youre passionate about building resilient systems that empower developers and want to shape the future of AIdriven software engineering, wed love to hear from you. Join Runloop and help build the infrastructure that powers tomorrows AI.
Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity, or any other characteristic protected by law.
#J-18808-Ljbffr



  • San Francisco, CA, United States Mercor, Inc. Full time

    Mercor is at the intersection of labor markets and AI research. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge, experience, and context that cant be captured in code alone. Today,...


  • San Francisco, CA, United States Air Apps Full time

    Join to apply for the Site Reliability Engineer (SRE) role at Air Apps Check below to see if you have what is needed for this opportunity, and if so, make an application asap. Join to apply for the Site Reliability Engineer (SRE) role at Air Apps Get AI-powered advice on this job and more exclusive features. About Air Apps At Air Apps, we believe...


  • San Francisco, CA, United States Speak Full time

    Our mission is to reinvent the way people learn, starting with language. Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around the world are actively trying to learn a language, but the best way to learn (oneonone tutoring) is hard to access at scale and hasnt been meaningfully improved...


  • San Francisco, CA, United States Canonical Full time

    Site Reliability / Gitops Engineer Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the worlds...


  • San Francisco, CA, United States Canonical Full time

    Site Reliability / Gitops Engineer – Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. Our customers include the...


  • San Francisco, CA, United States Google Inc. Full time

    Software Engineer III, Site Reliability Engineering If you think you are the right match for the following opportunity, apply after reading the complete description. Google San Francisco, CA, USA Apply X Applicants in San Francisco: Qualified applications with arrest or conviction records will be considered for employment in accordance with the San...


  • San Francisco, CA, United States Kontakt.io Full time

    io is building the platform that care operations run on. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations. Easy to deploy and...


  • San Francisco, CA, United States Kontakt.io Full time

    Kontakt.io is building the platform that care operations run on. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations. Easy to deploy...


  • San Francisco, United States Kontakt.io Full time

    Kontakt.io is building the platform that care operations run on. We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations. Easy to deploy...

  • Lead Software Engineer

    57 minutes ago


    San Francisco, United States Kontakt.io Full time

    Kontakt.io is building the platform that care operations run on.We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.Easy to deploy and...