Site Reliability Engineer
2 weeks ago
About Us All the relevant skills, qualifications and experience that a successful applicant will need are listed in the following description. Rivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive’s next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we’re addressing the challenges of electric vehicles through technology that will set the standards for software-defined vehicles around the world. The road to the future is uncharted. By combining our expertise across connectivity, AI, security and more, we’ll map a new way forward. Working together, we’ll create a future that’s more connected, more intelligent, more sustainable for everyone. Role Summary We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with cross-functional teams to create telemetry solutions that provide actionable insights into our distributed systems. Responsibilities Observability Platform Design : Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting. Telemetry Optimization : Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments. Performance Engineering : Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements. Scalable Automation : Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity. Incident Management : Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data. Tooling Development : Create and maintain self-service observability tools and dashboards to empower teams across the organization. Cross-functional Collaboration : Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle. Qualifications Educational Background : Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience. Experience : 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability. Technical Expertise : Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog. Experience with OpenTelemetry and distributed tracing in microservices architectures. Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane. Programming Skills : Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions. Cloud & Systems : Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals. Soft Skills: Exceptional problem-solving, communication, and a data-driven approach to decision-making. Pay Disclosure Salary Range/Hourly Rate for California Based Applicants: $146,900 - $194,610 USD Actual Compensation will be determined based on experience, location, and other factors permitted by law. Benefits Summary : Rivian and Volkswagen Group Technologies provides robust medical, prescription, dental and vision insurance packages for full-time employees, their spouse or domestic partner, and their children up to age 26. Coverage is effective on the first day of employment. Equal Opportunity Rivian and Volkswagen Group Technologies is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, ancestry, sex, sexual orientation, gender, gender expression, gender identity, genetic information or characteristics, physical or mental disability, marital/domestic partner status, age, military/veteran status, medical condition, or any other characteristic protected by law. We are also committed to ensuring compliance with all applicable fair employment practice laws regarding citizenship and immigration status. Rivian and Volkswagen Group Technologies is committed to ensuring that our hiring process is accessible for persons with disabilities. If you have a disability or limitation, such as those covered by the Americans with Disabilities Act, that requires accommodations to assist you in the search and application process, please email us at Candidate Data Privacy Rivian and VW Group Technologies ("Rivian and Volkswagen Group Technologies") may collect, use and disclose your personal information or personal data (within the meaning of the applicable data protection laws) when you apply for employment and/or participate in our recruitment processes ("Candidate Personal Data"). This data includes contact, demographic, communications, educational, professional, employment, social media/website, network/device, recruiting system usage/interaction, security and preference information. Rivian and Volkswagen Group Technologies may use your Candidate Personal Data for the purposes of (i) tracking interactions with our recruiting system; (ii) carrying out, analyzing and improving our application and recruitment process, including assessing you and your application and conducting employment, background and reference checks; (iii) establishing an employment relationship or entering into an employment contract with you; (iv) complying with our legal, regulatory and corporate governance obligations; (v) recordkeeping; (vi) ensuring network and information security and preventing fraud; and (vii) as otherwise required or permitted by applicable law. Rivian and Volkswagen Group Technologies may share your Candidate Personal Data with (i) internal personnel who have a need to know such information in order to perform their duties, including individuals on our People Team, Finance, Legal, and the team(s) with the position(s) for which you are applying; (ii) Rivian and Volkswagen Group Technologies affiliates; and (iii) Rivian and Volkswagen Group Technologies’ service providers, including providers of background checks, staffing services, and cloud services. Rivian and Volkswagen Group Technologies may transfer or store internationally your Candidate Personal Data, including to or in the United States, Canada, and the European Union and in the cloud, and this data may be subject to the laws and accessible to the courts, law enforcement and national security authorities of such jurisdictions. Please see our Candidate Data Privacy Notice (English) and Candidate Data Privacy Notice (Serbian) for more information. xrczosw Please note that we are currently not accepting applications from third party application services.
-
Site Reliability Engineer
5 hours ago
Palo Alto, United States FLUIX Full timeFLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America’s compute capacity...
-
Site Reliability Engineer – Kubernetes
5 hours ago
Palo Alto, United States Theklicker Full timeCompany Description theklicker is an online platform specializing in electronic product price comparison, enabling users to browse prices across multiple booking sites effortlessly. We are dedicated to being a one-stop solution for purchasing electronic products. With a focus on delivering the best user experience, theklicker empowers users to make informed...
-
Site Reliability Engineer
1 hour ago
Palo Alto, California, United States xAI Full timeAbout xAIxAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity.We operate with a flat organizational structure....
-
Site Reliability Engineer
5 hours ago
Palo Alto, United States Archetype AI Full timeGet AI-powered advice on this job and more exclusive features. About Archetype AI Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team from Google, Archetype AI is building a foundation model for the physical world, a real-time multimodal LLM for real life, transforming...
-
Head of Site Reliability Engineering
6 days ago
Palo Alto, United States Iopa Solutions Full timeOverview Do you thrive at the intersection of engineering, reliability, and leadership? Want to shape the reliability strategy of a high-growth SaaS company that operates at true global scale? We’re searching for a Head of Site Reliability Engineering to take ownership of our reliability vision, build and mentor a high-performing SRE organisation, and...
-
Site Reliability Engineer
4 weeks ago
Palo Alto, CA, United States xAI Full timexAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All...
-
Site Reliability Engineer
3 weeks ago
Palo Alto, CA, United States xAI Full timexAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All...
-
Product Infrastructure Engineer
4 weeks ago
Palo Alto, United States Zyphra Full timeZyphra is an artificial intelligence company based in Palo Alto, California.The Role:As a Infrastructure Engineer - Site Reliability, you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML...
-
Head of Site Reliability Engineering
1 week ago
Palo Alto, CA, United States Iopa Solutions Full timeDo you thrive at the intersection of engineering, reliability, and leadership? Were searching for a Head of Site Reliability Engineering to take ownership of our reliability vision, build and mentor a high-performing SRE organisation, and ensure our cloud-native platform remains fast, resilient, secure, and obsessively customer-focused. Own and evolve...
-
Site Reliability Engineer
5 hours ago
Palo Alto, United States Pantera Capital Full timeAbout xAIxAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational...