Site Reliability Engineer, HPC Infrastructure

4 weeks ago

Palo Alto, United States Tesla Full time

Site Reliability Engineer, HPC InfrastructureJoin to apply for the Site Reliability Engineer, HPC Infrastructure role at TeslaWhat To Expect Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting larger and increasingly complex. Continued development/automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As the scope and impact of our Optimus, Full-Self-Driving (FSD) & Robotaxi efforts continue to scale, so does the value of this team and its work.As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our FSD & Optimus engineering teams have the necessary tools and resources to be productive. This includes managing/operating our AI infrastructure, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and security. Your work will directly facilitate neural network training at scale & streamline FSD development.What You'll DoSupport the AI/ML cluster infrastructure on GPU platforms, focusing on systems automation, configuration management and deployment at scaleImprove our monitoring & self-healing pipelines, as well as security postureOptimize our server, storage and network performanceDevelop new tools in Python, Golang or Bash/ShellUse Infrastructure as Code best practicesParticipate in 24x7 on-call rotationWhat You'll BringProficiency in Python, Golang and/or BashProficiency with Linux fundamentals and performance optimizationsExperience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)Experience with containerization technologies such as KubernetesExperience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plusExperience with Slurm, LSF and storage management of parallel file systems is a plusBachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field3+ years of additional equivalent experience or evidence of exceptional ability related to the positionBenefitsCompensation and BenefitsAs a part-time Tesla employee, you will be eligible for:401(k) with employer matchEmployee Assistance ProgramSick and Vacation timeTesla Babies programBack-up childcare and parenting support resourcesPet InsuranceExpected Compensation$164,480 - $246,720/annual salary + benefitsPay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.Seniority levelMid-Senior levelEmployment typePart-timeJob functionEngineering and Information TechnologyIndustriesMotor Vehicle Manufacturing, Renewable Energy Semiconductor Manufacturing, and UtilitiesReferrals increase your chances of interviewing at Tesla by 2xGet notified about new Site Reliability Engineer jobs in Palo Alto, CA. #J-18808-Ljbffr

Site Reliability Engineer, AI/ML Infrastructure

2 weeks ago

Palo Alto, United States Boson AI Full time

About The Role We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. You'll be hands‑on with the full lifecycle of HPC infrastructure: planning, building, testing,...
Site Reliability Engineer, HPC Infrastructure

4 weeks ago

Palo Alto, United States Tesla Motors, Inc. Full time

What to Expect Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware & silicon design. With the rapidly-growing need for more data and optimized compute resources, cluster builds are getting...
Site Reliability Engineer

4 weeks ago

Palo Alto, United States Xai Full time

About xAIxAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational...
SRE for AI HPC Systems

2 days ago

Palo Alto, United States Pantera Capital Full time

A forward-thinking technology firm in Palo Alto seeks a Site Reliability Engineer to ensure the reliability and performance of their HPC infrastructure powering AI research. The role demands collaboration with cross-functional teams and responsibilities include designing scalable systems and troubleshooting complex issues. Ideal candidates should have 3+...
Staff HPC Infrastructure Engineer

2 weeks ago

Palo Alto, CA, United States Guardant Health Full time

Company Description Guardant Health is a leading precision oncology company focused on guarding wellness and giving every person more time free from cancer. Founded in 2012, Guardant is transforming patient care and accelerating new cancer therapies by providing critical insights into what drives disease through its advanced blood and tissue tests,...
Staff HPC Infrastructure Engineer

1 week ago

Palo Alto, CA, United States Guardant Health Full time

Company Description Guardant Health is a leading precision oncology company focused on guarding wellness and giving every person more time free from cancer. Founded in 2012, Guardant is transforming patient care and accelerating new cancer therapies by providing critical insights into what drives disease through its advanced blood and tissue tests,...
Staff HPC Infrastructure Engineer

7 days ago

Palo Alto, CA, United States Guardant Health Full time

Company Description Guardant Health is a leading precision oncology company focused on guarding wellness and giving every person more time free from cancer. Founded in 2012, Guardant is transforming patient care and accelerating new cancer therapies by providing critical insights into what drives disease through its advanced blood and tissue tests,...
Product Infrastructure Engineer

4 weeks ago

Palo Alto, United States Zyphra Full time

Zyphra is an artificial intelligence company based in Palo Alto, California.The Role:As a Infrastructure Engineer - Site Reliability, you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML...
Senior Site Reliability Engineer

2 weeks ago

Palo Alto, United States Mumba Technologies, Inc. Full time

About the Role We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with...
Senior Site Reliability Engineer

4 weeks ago

Palo Alto, United States Mumba Technologies, Inc. Full time

We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with software engineers,...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer, HPC Infrastructure