Site Reliability Engineer

4 weeks ago


Palo Alto, United States Ario Full time

Our Mission at Ario

You generate enormous amounts of personal data when you use the internet. This data is extremely powerful and could make your life easier, better, more magical. So why aren't you using it?

At Ario, we've developed a product that effortlessly enables you to consolidate your digital world - from your Twitter likes to your Kindle highlights - with a single click. Thanks to our unique data access approach, we're pioneering the definitive personal AI assistant.

It seamlessly merges GPT's problem-solving prowess with deep context about your life. Whether it's acting as a memory aid, providing insights about your life, or anticipating your future needs, Ario's AI intuitively understands you from the moment you two meet.

Our team consists of individuals who embody a big vision, show a lot of hustle, and share lots of laughter. The office exudes palpable energy, and we are eager to welcome the next team member Join us at Ario, and play a key role in building a future where personal data and AI intersect to empower the individual. We are based in-person in Palo Alto and offer relocation assistance as needed to new employees.

About the Role

We are currently seeking an exceptional DevOps/SRE to become a valuable member of our team. In this role, you will play a pivotal part in overseeing our application stack and cloud infrastructure, ensuring seamless orchestration and management of our services. Your responsibilities will encompass the design, development, and maintenance of our internal automation tools, which are crucial for efficiently managing our service lifecycle. Additionally, you'll be tasked with diagnosing and resolving runtime issues spanning the various tiers of our hosting stack.

Responsibilities

  • Manage containerized applications using technologies like Docker and Kubernetes
  • Implement monitoring, logging, and proactive issue identification
  • Architect and manage cloud-based infrastructure on GCP
  • Design resilient infrastructure for high availability and disaster recovery
  • Automate infrastructure setup and configuration using tools like Terraform, Ansible, or Puppet
  • Handle incident response and contribute to problem-solving efforts when necessary
  • Foster collaboration between development and operations teams

Qualifications

Your qualifications:

  • Bachelor's degree in Computer Science, Engineering, or a related field
  • 3+ years experience in building and maintaining cloud-native production infrastructure
  • Strong passion for meticulously documenting and automating intricate data systems
  • Proficiency in Infrastructure as Code (IaC) practices
  • A solid grasp of cutting-edge monitoring solutions and techniques
  • Expertise in at least one modern programming language such as Python, Go, or similar
  • Team player who is driven by ensuring the highest level of product quality
  • Desirable: Previous exposure to troubleshooting and enhancing production infrastructure

Required Skills

Establish and optimize CI/CD pipelines for automated software delivery



  • Palo Alto, California, United States Plume Full time

    About the JobThe Technical Manager will lead a team of Site Reliability Engineers, providing technical guidance and oversight. Key responsibilities include:Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds.Attend and conduct customer Meetings for Project and Roadmap specification.Manage growth and performance of...


  • Palo Alto, United States JPMorgan Chase Full time

    DESCRIPTION:Duties: Design, build and operate large-scale production systems. Debug complex problems across the whole stack. Develop tools for application engineering teams based on operations requirements for micro services. Improve alerting and monitoring for the existing services. Assist with onboarding and mentoring new engineers. Collaborate with the...


  • Palo Alto, California, United States Plume Full time

    About the CompanyPlume is a leader in the smart home and small business market, delivering services to over 50 million locations globally. Our software-defined network platform allows CSPs to decouple their service offerings from hardware and rapidly curate and deliver new services over a multi-vendor, open-platform architecture.We're looking for a seasoned...


  • Palo Alto, United States Navan Group Full time

    At Navan, “It’s all about the user. All of them.” We’re passionate about providing a seamless one-stop experience for business travelers, no matter how they travel, where they stay, or where they’re going. We are committed to building the most reliable, scalable, and efficient infrastructure to ensure our services are always available when...


  • Palo Alto, California, United States Tesla Full time

    Role DescriptionThis is a challenging opportunity to work with cutting-edge technology and contribute to the development of automation tools. As a Site Reliability Engineer, you will drive root cause analysis of system failures, manage containerization technology, and maintain site performance using various tools.Expected CompensationThe estimated annual...


  • Palo Alto, California, United States Assured Full time

    About Assured">At Assured, we modernize insurance by providing software solutions to large insurers. We empower them to win in a technology-driven world with self-service claim filing software and backend fraud detection.">Job Overview">We are looking for a Site Reliability Engineer to join our team. The ideal candidate will have experience working in a...


  • Palo Alto, United States Plume Design, Inc. Full time

    We’re looking for a seasoned Technical Manager, experienced with Customer Facing environments, to Captain our Site Reliability Engineering Team. This team is focused on deployments, fixes, and sustainability. The right candidate needs to have strong technical knowledge in key areas while focusing on customer satisfaction. What You’ll Do: Supervise a...


  • Palo Alto, United States Plume Full time

    Job DescriptionJob DescriptionLife at PlumeAt Plume, we believe that technology isn't about moving faster, it's about making life's moments better. Which is why we've built the world's first, and only, open and hardware-independent service delivery platform for smart homes, small businesses, enterprises, and beyond. Our SaaS platform uses...


  • Palo Alto, United States criteo Full time

    At Criteo we face some of the most challenging, but interesting problems in the IT industry. We work at a scale of speed, performance and complexity that few others in the industry can compete with. Our data is not big it’s absolutely HUGE. We have about 40 petabytes in our Hadoop storage (more than 30 TB extra per day), we take less than 10ms to respond...


  • Palo Alto, United States Tesla, Inc. Full time

    We are seeking an experienced Site Reliability Engineer (SRE) to join our team responsible for ensuring the reliability and performance of our Dojo cluster infrastructure. The successful candidate will be responsible for providing exceptional customer response and support, managing third-party systems, and collaborating with various teams to ensure seamless...


  • Palo Alto, California, United States Tesla Full time

    Company OverviewTesla is a leading electric vehicle manufacturer accelerating the world's transition to sustainable energy. Our mission-critical systems enable our engineers to design and develop innovative solutions.Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our Design Technology Operations team. This position will be...


  • Palo Alto, United States jobs.lever.co - ATS Full time

    The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make...


  • Palo Alto, California, United States Navan Group Full time

    At Navan, our vision is centered around providing a seamless user experience. We are passionate about delivering a one-stop-shop for business travelers, catering to their diverse needs and preferences.We are committed to building robust, scalable, and efficient infrastructure that ensures our services are always available when needed most. As we continue to...


  • Palo Alto, California, United States Wing Inflatables, Inc. Full time

    Role OverviewWing is seeking a highly experienced Design Reliability Engineer to join our Design for Excellence team in Palo Alto, California. As a key contributor to ensuring the reliability and robustness of our hardware designs, you will leverage your deep understanding of testing methodologies and reliability engineering principles to drive significant...


  • Palo Alto, California, United States Tesla Full time

    **About the Role:**Tesla is looking for a highly motivated Reliability Engineering Professional to join our team. As a key member of our engineering group, you will play a crucial role in ensuring the reliability of our innovative products.This position offers an exciting opportunity to contribute to the development of cutting-edge technology and shape the...


  • Palo Alto, California, United States Testing Solutions GmbH Full time

    Unlock the Future of Multimodal AILuma AI is revolutionizing the field of artificial intelligence by pushing beyond language models and developing more aware, capable, and useful systems. As a Senior Software Engineer in our Reliability team, you will play a critical role in defining, measuring, and improving the reliability of our GPU clusters. Our SRE team...


  • Palo Alto, United States Wing Inflatables, Inc. Full time

    About Wing:Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing’s fleet of highly automated delivery drones can transport small packages directly...


  • Palo Alto, California, United States Tesla Full time

    About the JobWe are looking for an experienced Site Reliability Engineer to join our team. Your responsibilities will include building release processes, managing Kubernetes infrastructure, and maintaining site performance. You will also participate in on-call rotations and facilitate production and security incidents.Required SkillsTo succeed in this role,...


  • Palo Alto, United States Tesla Full time

    As a Sr. Mechanical Reliability Engineer focusing on Tesla Megapack, you will play a key role in designing reliability into Tesla's industrial energy storage systems ensuring the products meet the highest standards of reliability. This role follows the reliability lifecycle of the product from concept to design, validation testing/analysis, manufacturing,...


  • Palo Alto, California, United States Luma AI Full time

    **Job Overview**Luma AI is seeking a highly skilled Reliability Solutions Engineer to join our team. As a key member of our Infrastructure and Research teams, you will be responsible for ensuring the health and reliability of our GPU clusters.We are looking for someone with a strong background in cloud infrastructure, containerization, and...