Current jobs related to Site Reliability Engineer, AI Infrastructure - Palo Alto, California - Tesla


  • Palo Alto, California, United States Foundry Technologies, Inc. Full time

    About FoundryFoundry Technologies, Inc. is a leading provider of AI infrastructure solutions. We are seeking a highly skilled Senior Infrastructure Reliability Engineer to join our team.Job SummaryWe are looking for a talented engineer to design, deploy, and maintain our AI infrastructure. The ideal candidate will have a strong background in cloud...


  • Palo Alto, California, United States lever - ATS Full time

    Job SummaryWe are seeking a highly skilled Principal Cloud Reliability Engineer to join our team at Luma AI. As a key member of our Infrastructure and Research teams, you will be responsible for ensuring the health and scalability of our GPU clusters.Key ResponsibilitiesCollaborate with researchers and engineers to specify the availability, performance,...

  • DevOps Engineer

    4 weeks ago


    Palo Alto, California, United States OpenTeams Full time

    Job OverviewOpenTeams is seeking a talented DevOps Engineer to join our dynamic team. As a DevOps Engineer, you will play a critical role in managing infrastructure for our AI/ML workloads, ensuring seamless operations for our internal teams and external customers.You will be responsible for setting up, maintaining, and optimizing cloud and on-premise...

  • Advanced AI Engineer

    3 weeks ago


    Palo Alto, California, United States Biostate AI Full time

    Job SummaryWe are seeking a highly skilled AI Engineer with expertise in Generative AI to join our team at Biostate AI. The ideal candidate will have a strong background in artificial intelligence, machine learning, and deep learning, with a focus on developing and deploying advanced AI models. You will work closely with our cross-functional teams to design...


  • Palo Alto, California, United States General Motors Full time

    Job DescriptionAt General Motors, we are pioneering next-generation software solutions for commercial fleet owners and their drivers. As a Site Reliability Engineer, you will play a critical role in improving the reliability, scalability, and operability of our production system.Responsibilities:Lead the Site Reliability engineering effort to improve anomaly...

  • Technical Manager

    3 weeks ago


    Palo Alto, California, United States Plume Full time

    Job OverviewAt Plume, we're seeking a seasoned Technical Manager to lead our Site Reliability Engineering Team. This team is responsible for ensuring the smooth operation of our cloud infrastructure, deploying new features, and resolving production issues.The ideal candidate will have a strong technical background, experience managing teams, and excellent...


  • Palo Alto, California, United States Foundry Technologies, Inc. Full time

    About the RoleWe are seeking a highly skilled Senior Cloud Infrastructure Engineer to join our team at Foundry Technologies, Inc. As a key member of our infrastructure team, you will be responsible for designing, deploying, and maintaining our cloud infrastructure to support our AI workloads.Your primary focus will be on ensuring the reliability,...


  • Palo Alto, California, United States Criteo Full time

    About the RoleCriteo is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining scalable and highly available systems that support our business-critical applications.You will work closely with our engineering teams to identify and resolve...


  • Palo Alto, California, United States Plume Design Inc Full time

    Job Title: Technical Manager, Site Reliability EngineeringWe're seeking a seasoned Technical Manager with expertise in Customer Facing environments to lead our Site Reliability Engineering Team. This team focuses on deployments, fixes, and sustainability. The ideal candidate will have strong technical knowledge in key areas while prioritizing customer...


  • Palo Alto, California, United States Harnham Full time

    About the Role:We are seeking a highly skilled Technical Lead Platform Engineer to join our team at Harnham. As a key member of our engineering team, you will be responsible for leading the architecture and development of our AI platform, ensuring it scales effectively and remains secure.Key Responsibilities:Design and architect the platform, ensuring...


  • Palo Alto, California, United States Rubrik Full time

    About The Role:As a Site Reliability Engineer at Rubrik, you will play a critical role in ensuring the smooth operation of our infrastructure services. You will work closely with product managers, designers, and other engineers to define the next generation of products for Rubrik.Key Responsibilities:Ensure high availability and durability of our...


  • Palo Alto, California, United States Woven by Toyota Full time

    Job DescriptionWoven by Toyota is a leading mobility technology subsidiary of Toyota Motor Corporation. Our mission is to deliver safe, intelligent, and human-centered mobility for all. We achieve this through our Arene mobility software platform, safety-first automated driving technology, and Toyota Woven City - our test course for advanced mobility.Our...


  • Palo Alto, California, United States Lutra AI Full time

    Lutra AI is a pioneering technology company that empowers individuals to harness the full potential of AI, freeing up time for what truly matters.We're a tight-knit team based in the San Francisco Bay Area, renowned for our expertise in AI.If you're passionate about learning and applying the latest AI technologies to create innovative products, you'll thrive...


  • Palo Alto, California, United States Tesla Full time

    Job Title: HPC Engineer, AI InfrastructureTesla's AI Infrastructure team is responsible for designing and maintaining the high-performance computing systems that power our machine learning algorithms. As an HPC Engineer, you will play a critical role in ensuring the smooth operation of our AI infrastructure, including virtual simulations, Autopilot hardware,...


  • Palo Alto, California, United States Tesla Full time

    Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our PLM Operations team at Tesla. As a key member of our team, you will be responsible for ensuring the reliability and performance of our PLM systems, which are critical to the success of our engineering design tools.As a Site Reliability Engineer, you will work closely with our...


  • Palo Alto, California, United States Inflection AI Full time

    At Inflection AI, we're building a cutting-edge AI platform for enterprise applications, and we're looking for a talented Machine Learning Software Engineer to join our team.About the RoleThis is a critical role in integrating ML frameworks and models into our platform for enterprise applications. As a Machine Learning Software Engineer, you will develop,...


  • Palo Alto, California, United States Luma AI Full time

    Job Description:Luma AI is seeking a highly skilled Senior Backend Engineer to join our team. As a key member of our engineering team, you will be responsible for designing and building the development and production platforms that power our new products, enabling reliability and security at scale.Responsibilities:Design and build the development and...


  • Palo Alto, California, United States Foundry Technologies, Inc. Full time

    About Foundry Technologies, Inc.Foundry Technologies, Inc. is revolutionizing the way AI companies access compute power. Our mission is to orchestrate the world's compute capacity, making it easier to use and optimized for AI workloads. We're building a new type of public cloud, one designed specifically for AI, where accessing high-performance compute is as...


  • Palo Alto, California, United States Luma AI Full time

    About the RoleWe are seeking a highly skilled Backend Engineer to join our team at Luma AI. As a Backend Engineer, you will be responsible for designing and building the development and production platforms that power our new products, enabling reliability and security at scale.Key ResponsibilitiesDesign and build the development and production platforms...


  • Palo Alto, California, United States Tesla Full time

    Job SummaryWe are seeking a highly skilled Software Engineer to join our Autonomy team at Tesla. As a Software Engineer, you will contribute to the development of our AI inference and runtime stack, working closely with AI Engineers and Hardware Engineers to build the frameworks and infrastructure that enable the seamless deployment, integration, and...

Site Reliability Engineer, AI Infrastructure

4 weeks ago


Palo Alto, California, United States Tesla Full time
About the Role

We are seeking a highly skilled Site Reliability Engineer to join our AI Infrastructure team at Tesla. As a key member of our team, you will be responsible for maintaining and improving our platform to ensure our Full-Self-Driving (FSD), Tesla Bot & Dojo engineering teams have the necessary tools and resources to be productive.

Key Responsibilities
  • Manage and operate our AI infrastructure, monitoring compute/GPU/network metrics, Linux troubleshooting & performance tuning, and security.
  • Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale.
  • Improve our monitoring & self-healing pipelines, as well as security posture.
  • Optimize our server, storage and network performance.
  • Develop new tools in Python, Golang or Bash/Shell.
  • Use Infrastructure as Code best practices.
  • Participate in 24x7 on-call rotation.
Requirements
  • Proficiency in Python, Golang and/or Bash.
  • Proficiency with Linux fundamentals and performance optimizations.
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.).
  • Experience with containerization technologies such as Kubernetes.
  • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high-performance storage systems is a plus.
  • Experience with Slurm, LSF and storage management of parallel file systems is a plus.
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field.
  • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position.
What We Offer
  • Competitive salary and benefits package.
  • Opportunity to work on cutting-edge AI and machine learning projects.
  • Collaborative and dynamic work environment.
  • Professional growth and development opportunities.