Senior Site Reliability Engineer GPU Infrastructure

4 weeks ago

San Francisco, United States Genmo Full time

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.What You’ll DoOwn the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.Minimum QualificationsBS/MS/PhD in CS, EE, or related field.3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.Expert‑level Kubernetes experience.Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.Nice to HaveMulti‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)GPU schedulers such as Slurm or Kueue.Familiarity with CI/CD tooling (GitHub Actions, BuildKit).Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Site Reliability Engineer — GPU Infrastructure

4 days ago

San Francisco, United States Genmo Full time

Site Reliability Engineer — GPU Infrastructure Join Genmo, a research lab dedicated to building open, state‑of‑the‑art models for video generation. We are looking for a Site Reliability Engineer to build and operate GPU infrastructure that powers our generative models. This is a contract‑to‑hire position. What You’ll Do Own design and...
Site Reliability Engineer GPU Infrastructure

4 weeks ago

San Francisco, United States Genmo Full time

Site Reliability Engineer GPU Infrastructure Join Genmo, a research lab dedicated to building open, state?of?the?art models for video generation. We are looking for a Site Reliability Engineer to build and operate GPU infrastructure that powers our generative models. This is a contract?to?hire position. What Youll Do Own design and day?to?day operation of...
Senior HPC

4 days ago

San Francisco, United States Sciforium Full time

Join to apply for the Senior HPC & GPU Infrastructure Engineer role at Sciforium Role Overview We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster. You will be the primary PyTorch custodian of our high‑density accelerator environment and the linchpin between...
Senior Site Reliability Engineer

3 weeks ago

San Francisco, United States Hamilton Barnes Associates Limited Full time

Join a stealth-mode hyperscale data center startup building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless...
Senior / Principal Site Reliability Engineer

4 weeks ago

San Francisco, United States Datacrunch Full time

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy. We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high...
Senior / Staff Site Reliability Engineer (SRE)

3 weeks ago

San Francisco, United States DevOps projects Full time

2025-10-25 Senior / Staff Site Reliability Engineer (SRE) Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in...
Principal Site Reliability Engineer

4 days ago

San Francisco, United States Epoch Biodesign Full time

Location San Francisco, CA - US Employment Type Full time Location Type On-site Department Cloud Engineering Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. Be a part of the AI...
Senior+ Site Reliability Engineer

4 days ago

San Francisco, CA, United States Crusoe Full time

Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and...
Senior+ Site Reliability Engineer

2 weeks ago

San Francisco, CA, United States Crusoe Full time

Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and...
Senior+ Site Reliability Engineer

6 days ago

San Francisco, CA, United States Crusoe Full time

Crusoe's mission is to accelerate the abundance of energy and intelligence. We're crafting the engine that powers a world where people can create ambitiously with AI - without sacrificing scale, speed, or sustainability. Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and...

Americas

Europe

Asia / Oceania

Africa

Senior Site Reliability Engineer GPU Infrastructure