Sr. Site Reliability Engineer, Dojo
6 days ago
We are seeking an experienced Site Reliability Engineer (SRE) to join our team responsible for ensuring the reliability and performance of our Dojo cluster infrastructure. The successful candidate will be responsible for providing exceptional customer response and support, managing third-party systems, and collaborating with various teams to ensure seamless operations. If you have a passion for troubleshooting, automation, and collaboration, we encourage you to apply.
What You’ll Do- Respond to customer inquiries and resolve issues in a timely and professional manner
- Manage and prioritize change requests, ensuring minimal disruption to cluster operations
- Collaborate with third-party storage vendors to resolve issues and outages
- Troubleshoot and debug storage-related problems, ensuring prompt resolution and minimal downtime
- Work with network vendors to debug and resolve issues, improving overall network reliability
- Create visibility into network issues, developing and implementing monitoring and reporting tools to enhance transparency
- Collaborate with facility and operations teams to plan and execute maintenance, upgrades, and shutdowns
- Ensure seamless communication and coordination during planned and unplanned outages
- Troubleshoot and debug hardware issues through automation, identifying root causes and implementing fixes
- Develop and implement automation scripts to improve hardware monitoring and maintenance
- 3+ years of experience in a similar SRE or infrastructure engineering role
- Strong understanding of Linux, networking, and storage systems
- Excellent problem-solving and troubleshooting skills, with the ability to debug complex issues
- Experience with automation tools, such as Ansible, Python, or similar
- Strong communication and collaboration skills, with the ability to work with various teams and vendors
- Ability to work in a fast-paced environment, with a focus on delivering high-quality results
- Familiarity with monitoring and logging tools, such as Prometheus, Grafana, or ELK preferred
- Experience with cloud-based infrastructure preferred
Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:
- Aetna PPO and HSA plans > 2 medical plan options with $0 payroll deduction
- Family-building, fertility, adoption and surrogacy benefits
- Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
- Company Paid (Health Savings Account) HSA Contribution when enrolled in the High Deductible Aetna medical plan with HSA
- Healthcare and Dependent Care Flexible Spending Accounts (FSA)
- 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
- Company paid Basic Life, AD&D, short-term and long-term disability insurance
- Employee Assistance Program
- Sick and Vacation time (Flex time for salary positions), and Paid Holidays
- Back-up childcare and parenting support resources
- Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
- Weight Loss and Tobacco Cessation Programs
- Tesla Babies program
- Commuter benefits
- Employee discounts and perks program
$120,000 - $228,000/annual salary + cash and stock awards + benefits
Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.
Tesla is an Equal Opportunity / Affirmative Action employer committed to diversity in the workplace. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, age, national origin, disability, protected veteran status, gender identity or any other factor protected by applicable federal, state or local laws.
Tesla is also committed to working with and providing reasonable accommodations to individuals with disabilities. Please let your recruiter know if you need an accommodation at any point during the interview process.
Privacy is a top priority for Tesla. We build it into our products and view it as an essential part of our business. To understand more about the data we collect and process as part of your application, please view our Tesla Talent Privacy Notice.
#J-18808-Ljbffr-
Software Engineer, Control Plane, Dojo
1 week ago
Palo Alto, United States Tesla Full timeAs a member of the Dojo team, you will be responsible for enabling Tesla's neural networks to train efficiently on our upcoming in-house custom-silicon supercomputer systems. Join a small team of experienced developers in building the drivers and control plane for the Dojo distributed training system. Responsibilities Work on the Dojo distributed system to...
-
Palo Alto, United States Tesla Full timeWe are seeking a highly skilled Software Engineer to join our team and contribute to the development of our Dojo Datacenter Platform. As a key member of our infrastructure team, you will design, develop, and deploy software that ensures the reliability, availability, and scalability of our datacenter operations. You will have a strong focus on network...
-
Software Engineer, ML Compiler, Dojo
7 days ago
Palo Alto, United States Tesla Full timeAs a member of the Dojo compiler team, you will be responsible for enabling Tesla's neural networks to train efficiently on our upcoming in-house custom-silicon supercomputer systems. Join a small team of experienced developers in automating the compilation of PyTorch-derived neural network graphs into programs that run on Tesla's custom FSD computer. The...
-
Software Engineer, ML Infra, Dojo
6 days ago
Palo Alto, United States Tesla Full timeAs a ML Software Engineer within Dojo, which is our supercomputer designed entirely in-house, you will play a crucial role in bridging the gap between our cutting-edge Dojo training accelerator and the neural networks developed by our Autopilot ML team. Collaborate closely with world-class ML Researchers, Compiler and Hardware Engineers to tackle unique...
-
Mechanical Design Engineer, Dojo
5 days ago
Palo Alto, United States Tesla Full timeWe are seeking a highly skilled and motivated Mechanical Design Engineer to join our dynamic team. As a Mechanical Design Engineer, you will play a pivotal role in developing innovative and reliable mechanical solutions for next generation of computer systems for Dojo - Tesla's supercomputer. Your expertise will be instrumental in designing efficient and...
-
IC Package Process Engineer, Dojo
5 days ago
Palo Alto, United States Tesla Full timeTesla's Dojo & Self-Driving Hardware team is looking for an IC Package Process Engineer who will be responsible for assembly process for advanced IC package including pathfinding, development and high volume manufacturing design for the next generation of Self-Driving Hardware and Dojo Super AI Computer. In this highly visible role, this engineer will...
-
Site Reliability Engineering Manager
3 weeks ago
Palo Alto, California, United States Plume Full timeAbout the JobThe Technical Manager will lead a team of Site Reliability Engineers, providing technical guidance and oversight. Key responsibilities include:Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds.Attend and conduct customer Meetings for Project and Roadmap specification.Manage growth and performance of...
-
Site Reliability Engineer
1 week ago
Palo Alto, United States JPMorgan Chase Full timeDESCRIPTION:Duties: Design, build and operate large-scale production systems. Debug complex problems across the whole stack. Develop tools for application engineering teams based on operations requirements for micro services. Improve alerting and monitoring for the existing services. Assist with onboarding and mentoring new engineers. Collaborate with the...
-
Palo Alto, United States Tesla Full timeWe are seeking a highly skilled Software Engineer to join our team and contribute to the development of our Dojo Datacenter Platform. As a key member of our infrastructure team, you will design, develop, and deploy software that ensures the reliability, availability, and scalability of our datacenter operations. You will focus on building the control plane...
-
Staff Firmware Engineer, Dojo
7 days ago
Palo Alto, United States Tesla Full timeThe Self-Driving Hardware team is looking for a Staff Firmware Engineer to join the team in Palo Alto, CA. Firmware Engineers are expected to architect, participate in system design, develop, test and document firmware for the Self-Driving and Dojo systems. The successful applicant will join a team of deeply knowledgeable Engineers and have an opportunity to...
-
Palo Alto, United States Tesla Full timeAs a Sr. Mechanical Reliability Engineer focusing on Tesla Megapack, you will play a key role in designing reliability into Tesla's industrial energy storage systems ensuring the products meet the highest standards of reliability. This role follows the reliability lifecycle of the product from concept to design, validation testing/analysis, manufacturing,...
-
Technical Site Reliability Engineering Leader
4 weeks ago
Palo Alto, California, United States Plume Full timeAbout the CompanyPlume is a leader in the smart home and small business market, delivering services to over 50 million locations globally. Our software-defined network platform allows CSPs to decouple their service offerings from hardware and rapidly curate and deliver new services over a multi-vendor, open-platform architecture.We're looking for a seasoned...
-
Manager, Site Reliability Engineering
1 month ago
Palo Alto, United States Navan Group Full timeAt Navan, “It’s all about the user. All of them.” We’re passionate about providing a seamless one-stop experience for business travelers, no matter how they travel, where they stay, or where they’re going. We are committed to building the most reliable, scalable, and efficient infrastructure to ensure our services are always available when...
-
Palo Alto, United States Tesla Full timeAs a member of the Dojo Machine Learning team, you will be responsible for developing and optimizing simulations of the architecture of a massively parallel machine for AI training. The ideal candidate will have a strong background in computer architecture, analytical and cycle-based simulation, and AI workloads, with a passion for delivering...
-
Site Reliability Engineer
4 weeks ago
Palo Alto, California, United States Tesla Full timeRole DescriptionThis is a challenging opportunity to work with cutting-edge technology and contribute to the development of automation tools. As a Site Reliability Engineer, you will drive root cause analysis of system failures, manage containerization technology, and maintain site performance using various tools.Expected CompensationThe estimated annual...
-
Site Reliability Infrastructure Engineer
3 weeks ago
Palo Alto, California, United States Assured Full timeAbout Assured">At Assured, we modernize insurance by providing software solutions to large insurers. We empower them to win in a technology-driven world with self-service claim filing software and backend fraud detection.">Job Overview">We are looking for a Site Reliability Engineer to join our team. The ideal candidate will have experience working in a...
-
Manager, Site Reliability Engineering
1 week ago
Palo Alto, United States Plume Design, Inc. Full timeWe’re looking for a seasoned Technical Manager, experienced with Customer Facing environments, to Captain our Site Reliability Engineering Team. This team is focused on deployments, fixes, and sustainability. The right candidate needs to have strong technical knowledge in key areas while focusing on customer satisfaction. What You’ll Do: Supervise a...
-
Sr. Firmware Engineer, Dojo
7 days ago
Palo Alto, United States Tesla Full timeThe Self-Driving Hardware team is looking for a Senior Firmware Engineer to join the team in Palo Alto, CA. Firmware Engineers are expected to develop, test, and document firmware for the Self-Driving system. The successful applicant will join a team of deeply knowledgeable Engineers and have an opportunity to solve ambitious and challenging problems in the...
-
Datacenter Software Engineer
4 weeks ago
Palo Alto, California, United States Tesla Full timeAbout the RoleWe are seeking a highly skilled Datacenter Software Engineer to join our team and contribute to the development of our Dojo Datacenter Platform. This is an exciting opportunity for a talented software engineer to design, develop, and deploy software that ensures the reliability, availability, and scalability of our datacenter operations.This...
-
Manager, Site Reliability Engineering
4 days ago
Palo Alto, United States Plume Full timeJob DescriptionJob DescriptionLife at PlumeAt Plume, we believe that technology isn't about moving faster, it's about making life's moments better. Which is why we've built the world's first, and only, open and hardware-independent service delivery platform for smart homes, small businesses, enterprises, and beyond. Our SaaS platform uses...