Site Reliability Engineer, AI/ML Infrastructure

2 weeks ago


Palo Alto, United States Boson AI Full time

About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. You'll also help us plan for future capacity and evaluate new technologies as we continue to scale. Responsibilities Manage and optimize HPC cluster operations Deploy and maintain infrastructure‑as‑code solutions Support ML/research teams with cluster usage optimization Operate, troubleshoot and optimize Ceph storage clusters Develop automation and tooling Minimum Qualifications 5+ years of experience in SRE or HPC operations. Proficiency in Linux systems administration (Ubuntu/Debian). Experience with Kubernetes and container orchestration. Experience with Ceph >1PB deployments and maintenance. Knowledge of security best practices in multi‑tenant environments. Understanding of L2/L3 networking fundamentals. Skilled in Python and Bash scripting. Preferred Qualifications Experience with infrastructure‑as‑code tools (Ansible/Terraform). Experience with GitOps (Helm, ArgoCD). Strong grasp of RDMA, InfiniBand, and GPUDirect technologies. Familiarity with deep learning frameworks such as PyTorch and TensorFlow. Familiarity in at least one cloud platform: AWS, Azure or GCP. $150,000 - $250,000 a year If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you. #J-18808-Ljbffr



  • Palo Alto, United States Archetype AI Full time

    Get AI-powered advice on this job and more exclusive features. About Archetype AI Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team from Google, Archetype AI is building a foundation model for the physical world, a real-time multimodal LLM for real life, transforming...


  • Palo Alto, California, United States Archetype AI Full time $100,000 - $120,000 per year

    About Archetype AIArchetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team from Google, Archetype AI is building a foundation model for the physical world, a real-time multimodal LLM for real life, transforming real-world data into valuable insights and knowledge that people will...


  • Palo Alto, United States Zyphra Full time

    Zyphra is an artificial intelligence company based in Palo Alto, California.The Role:As a Infrastructure Engineer - Site Reliability, you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML...


  • Palo Alto, United States Boson AI Full time

    About The Role We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technologymanaging InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage,...


  • Palo Alto, United States Boson AI Full time

    About The Role We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technologymanaging InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage,...


  • Palo Alto, United States Zyphra Full time

    ZyphraZyphra is an artificial intelligence company based in Palo Alto, California.The RoleAs a Machine Learning Infrastructure Engineer - Site Reliability, you'll be responsible for designing and maintaining the systems that keep Zyphra's infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and...


  • Palo Alto, United States Tesla Full time

    Site Reliability Engineer, HPC InfrastructureJoin to apply for the Site Reliability Engineer, HPC Infrastructure role at TeslaWhat To Expect Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware...


  • Palo Alto, United States FLUIX Full time

    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America’s compute capacity...


  • Palo Alto, United States Tesla Motors, Inc. Full time

    What to Expect Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting...


  • Palo Alto, United States Xai Full time

    About xAIxAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational...