Senior Cluster Site Reliability Engineer

7 days ago


Berkeley, California, United States The Voleon Group Full time $205,000 - $235,000 per year

Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future.

As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon.

The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs.

Responsibilities

  • Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
  • Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
  • Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
  • Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
  • Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
  • Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
  • Experience with cloud infrastructure (AWS or GCP)
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
  • Experience with distributed storage technologies (Lustre, Ceph, S3)
  • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
  • Bachelor degree in computer science

Preferred Qualifications

  • Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
  • Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
  • Familiarity with hybrid/on-prem environments
  • Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
  • Experience with HPC networking (InfiniBand, RDMA)
  • Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

The base salary range for this position is $205,000 to $235,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match.

"Friends of Voleon" Candidate Referral Program
If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program .

Equal Opportunity Employer
The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

We may use artificial intelligence (AI) tools to support parts of the hiring process. These tools assist our recruitment team but do not replace human judgement. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.



  • Berkeley, California, United States FAR Full time $150,000 - $250,000

    About UsFAR.AI is a non-profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone. Our mission is to facilitate breakthrough AI safety research, advance global understanding of AI risks and solutions, and foster a coordinated global response.Since our founding in July 2022, we've grown quickly to 30+ staff, producing...


  • Berkeley, California, United States Gannett Fleming, Inc. Full time $145,000 - $190,000

    GFT is seeking a Senior Civil Engineer/ Project Manager to support a variety of projects in the West Region. This role follows a hybrid work model, requiring regular attendance at our Berkeley office. Working on the Transportation team at GFT offers the opportunity to engage in transformative projects that enhance transportation infrastructure and...


  • Berkeley, California, United States Certerra (formerly RMA Companies) Full time $120,000 - $180,000 per year

    COMPANY OVERVIEW:Certerra is a growing national provider of testing, inspection, and certification services that support innovation in new product development, quality assurance, for project delivery, and engineered solutions for asset management. We are passionate about contributing to the sustainable development of the communities we serve.We are a...


  • Berkeley, California, United States Rigetti Computing Full time

    As a Quantum Engineer in the quantum integrated circuit design team, you will transform our quantum integrated circuit design and simulation workflow, enabling Rigetti to model increasingly complex and larger scale quantum processors, at the 1000Q scale and beyond. Leveraging your experience in computational electromagnetics and commercially available...


  • Berkeley, California, United States Form Energy Full time $128,700 - $168,000 per year

    Are you ready to build America's energy future? Form Energy is an American manufacturing and energy technology company. We're revolutionizing energy storage with cost-effective, multi-day technology designed to keep the electric grid secure and reliable, even during extended periods of stress. By strengthening the electric system and reimagining what's...


  • Berkeley, California, United States Bellwether Coffee Full time $135,000 - $185,000

    Role: Senior Manager, ManufacturingDepartment: OperationsReports To: Chief Operating OfficerCompensation Range: $135, $185,000.00 annual salaryAbout UsHeadquartered in Berkeley, CA, Bellwether Coffee is working to positively transform the coffee industry by making coffee roasting more accessible and sustainable. Our revolutionary electric, ventless...


  • Berkeley, California, United States Bellwether Coffee Full time

    Role: Senior Manager, ManufacturingDepartment: OperationsReports To: Chief Operating OfficerCompensation Range: $135, $185,000.00 annual salaryAbout UsHeadquartered in Berkeley, CA, Bellwether Coffee is working to positively transform the coffee industry by making coffee roasting more accessible and sustainable.Our revolutionary electric, ventless commercial...


  • Berkeley, California, United States Boeing Full time $109,650 - $148,350

    Senior Manufacturing EngineerCompany:The Boeing CompanyThe Boeing Defense, Space & Security (BDS) team is seeking a Senior Manufacturing Engineer located in Berkeley, MO.About Us: At Boeing St. Louis, we are leaders in aerospace innovation, committed to shaping the future of flight. Our facility serves as a hub for advanced manufacturing, where we design and...


  • Berkeley, California, United States Terranova Full time $120,000 - $180,000 per year

    Company DescriptionBacked by leading climate and American dynamism investors, Terranova builds intelligent robotic systems to terraform the Earth itself - lifting land, restoring wetlands, and protecting critical infrastructure from floods and sea-level rise.Our mission is to preserve the built environment, create new habitats, and usher in an era of...


  • Berkeley, California, United States LEAF (Linking Environment And Farming) Full time $186,509 - $279,764

    LEAF Engineers is a frontrunner for success in providing comprehensive mechanical, electrical and plumbing, technology, and fire protection engineering design services. Our engineers are focused on system performance, reliability, flexibility, and ease of maintenance. Our work typically consists of large commercial projects, primarily award-winning K-12...