Staff Software Engineer- Reliability

1 week ago


Palo Alto, United States Luma AI Full time

The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing, and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make full use of them. Successful candidates will want to get extremely in the weeds solving performance and maintenance problems in our clusters.


Responsibilities
  • Collaborate with researchers and engineers to specify the availability, performance, correctness, and efficiency requirements of the current and future versions of our GPU infrastructure.
  • Work with multiple GPU cloud providers to scale up, scale down, maintain and monitor our 000's GPUs in many clusters.
  • Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands.
  • Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment.
  • Implement fault-tolerant and resilient design patterns to minimize service disruptions.
  • Build and maintain automation tools to streamline repetitive tasks and improve system reliability.
  • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability alongside other infrastructure developers.
  • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability.
Experience
  • Proven work experience 10+ yrs as a reliability engineer, production engineer, infrastructure software engineer or a similar role in a fast-paced, rapidly scaling company.
  • Strong proficiency in GPU cloud infrastructure, including the underlying concepts of scheduling, scaling, cloud storage, networking, and security.
  • Proficiency in programming/scripting languages.
  • Experience with containerization technologies and container orchestration platforms like Kubernetes or equivalent.
  • Knowledge of IaC tools such as Terraform or CloudFormation or equivalent.
  • Excellent problem-solving and troubleshooting skills.
  • Strong communication and collaboration skills.
  • Experience with observability tools; examples include DataDog, Prometheus, Grafana, Splunk, and ELK stack or similar.
  • Knowledge of security best practices in cloud environments.
  • Good to have experience as an SRE within the AI/ML space is strongly preferred.

$200,000 - $250,000 a yearIn addition to cash base pay, you'll also receive a sizable grant of Luma's equity.The pay range for this position is $200000-250000/yr for Bay Area. Base pay offered will vary depending on job-related knowledge, skills, candidate location, and experience.

Your application is reviewed by real people.

#J-18808-Ljbffr

  • Palo Alto, United States Luma AI Full time

    The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make...


  • Palo Alto, United States Wing Full time

    About Wing: Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing's fleet of highly automated delivery drones can transport small packages directly...


  • Palo Alto, United States Wing Inflatables, Inc. Full time

    About Wing:Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing’s fleet of highly automated delivery drones can transport small packages directly...


  • Palo Alto, United States Wing Aviation Full time

    About Wing: Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing's fleet of highly automated delivery drones can transport small packages directly...


  • Palo Alto, United States Ladder Full time

    About Ladder We saw a problem within the life insurance industry: getting covered took too long, involved too much paperwork, and required too many in-person meetings with sales agents. Having lost his father at a young age, our CEO, Jamie, was determined to make it easier for people to get the coverage they needed to provide for their families. So, we got...


  • Palo Alto, United States Velocity Global, LLC Full time

    Velocity Global offers the most unified, tech-enabled, and customer service-driven global workforce management, ensuring smooth, reliable operations across countries, roles, and workforce types so businesses can navigate complexity with confidence, deliver strong results, and stay ahead. We help you expand your business into new markets without the...


  • Palo Alto, United States Luma AI Full time

    The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make...


  • Palo Alto, United States Avature Full time

    Senior Software Engineer - Network ReliabilityLocation: LondonBusiness Area: Engineering and CTORef #: 10040510Description & RequirementsAs a Network Reliability Engineer, you will work within a team of software engineers that are responsible for the tooling, automation & stability of our Global Network Infrastructure that supports Bloomberg products and...


  • Palo Alto, California, United States Machinify, Inc. Full time

    Machinify is a leading provider of AI-powered software products that revolutionize healthcare claims and payment operations. With over $200 billion in annual mispayments, the healthcare industry faces incredible waste, friction, and frustration for patients, providers, and payers alike. Machinify's innovative AI-platform has enabled rapid development and...


  • Palo Alto, California, United States ZipRecruiter Full time

    About MachinifyMachinify is a leading provider of AI-powered software products transforming healthcare claims and payment operations. The company's revolutionary AI-platform has enabled the development and deployment of industry-specific products that increase the speed and accuracy of claims processing.We are seeking a highly skilled Staff Software Engineer...


  • Palo Alto, California, United States Axiom Software Solutions Limited Full time

    Job DescriptionWe are seeking an experienced Embedded Software Development Engineer to join our team at Axiom Software Solutions Limited in Palo Alto, California.The ideal candidate will have a strong background in real-time systems and hardware interfacing, with experience working with QNX RTOS, DDS middleware, and Intel 8255 PPI hardware.Key...


  • Palo Alto, United States United Software Group Full time

    Hi, Hope you are doing well. Job Title: Embedded Software Engineer Location: Palo Alto, California Duration: Fulltime Job Description: Key Responsibilities: DDS Communication establishment on Embedded Systems Design and implement Data Distribution Service (DDS) communication layers for real-time data exchange between system components. Establish reliable,...


  • Palo Alto, United States Ford Motor Company Full time

    Job Description Ford Model E Platform Architecture Engineering is looking for a Staff embedded software engineer to work on the Platform OTA. In this role you will be expected to work closely with the solution architect in clarifying requirements and responsible for the development of the automotive software solution and embedded software modules for...


  • Palo Alto, California, United States PsiQuantum Full time

    Company OverviewPsiQuantum is on a mission to build the world's first useful quantum computer. Our team is working on building a utility scale quantum computer and the software tools needed to build fault-tolerant quantum applications. We believe in harnessing the laws of quantum physics to provide exponential performance increases over today's most powerful...


  • Palo Alto, CA, United States Velocity Global Full time

    Velocity Global offers the most unified, tech-enabled, and customer service-driven global workforce management, ensuring smooth, reliable operations across countries, roles, and workforce types so businesses can navigate complexity with confidence, deliver strong results, and stay ahead. We help you expand your business into new markets without the...


  • Palo Alto, United States Luma AI Full time

    Luma’s mission is to build multimodal AI to expand human imagination and capabilities.We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for...


  • Palo Alto, California, United States Ford Motor Company Full time

    Job DescriptionWe are looking for a talented OTA Software Development Lead to join our team. In this role, you will be responsible for leading the development of next-generation software update components for electric vehicles. You will work closely with architects, engineers, and other technical specialists to design an integrated solution and partner with...


  • Palo Alto, United States Flow MD Full time

    About the Company At Flow, we're on a mission to enhance living experiences across communities by leveraging the power of technology. Our focus is on developing, owning, and managing multifamily apartment buildings, where we implement cutting-edge solutions to provide superior living conditions and foster vibrant communities. Our success is built on a...


  • Palo Alto, CA, United States Ladder Insurance Full time

    About Ladder We saw a problem within the life insurance industry: getting covered took too long, involved too much paperwork, and required too many in-person meetings with sales agents. Having lost his father at a young age, our CEO, Jamie, was determined to make it easier for people to get the coverage they needed to provide for their families. So, we got...


  • Palo Alto, California, United States Lanai Software Full time

    Job SummaryLanai Software is a pioneering company in the field of GenAI, focused on empowering humans to achieve the extraordinary in the age of AI. We're seeking an experienced ML and Data Science Engineer to join our team and contribute to the development of the world's best enterprise AI platform.As a key member of our team, you'll be responsible for...