Senior Software Engineer- Reliability

1 week ago


Palo Alto, United States Luma AI Full time

The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make full use of them. Successful candidates will want to get extremely in the weeds solving performance and maintenance problems in our clusters.

Responsibilities

    • Collaborate with researchers and engineers to specify the availability, performance, correctness, and efficiency requirements of the current and future versions of our GPU infrastructure.
    • Work with multiple GPU cloud providers to scale up, scale down, maintain and monitor our 000's GPUs in many clusters.
    • Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands.
    • Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment.
    • Implement fault-tolerant and resilient design patterns to minimize service disruptions.
    • Build and maintain automation tools to streamline repetitive tasks and improve system reliability.
    • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability alongside other infrastructure developers.
    • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability.
Experience
    • Proven work experience 5+ yrs as an reliability engineer, production engineer, infrastructure software engineer or a similar role in a fast-paced, rapidly scaling company.
    • Strong proficiency in GPU cloud infrastructure, including the underlying concepts of scheduling, scaling, cloud storage, networking and security.
    • Proficiency in programming/scripting languages.
    • Experience with containerization technologies and container orchestration platforms like Kubernetes or equivalent.
    • Knowledge of IaC tools such as Terraform or CloudFormation or equivalent.
    • Excellent problem-solving and troubleshooting skills.
    • Strong communication and collaboration skills.
    • Experience with observability tools; examples include DataDog, Prometheus, Grafana, Splunk and ELK stack or similar.
    • Knowledge of security best practices in cloud environments.
    • Good to have experience as an SRE within the AI/ML space is strongly preferred.
    • Please note this role is not meant for recent grads.


$180,000 - $250,000 a year

In addition to cash base pay, you'll also receive a sizable grant of Luma's equity.

The pay range for this position is $180000- 250000/yr for Bay Area. Base pay offered will vary depending on job-related knowledge, skills, candidate location, and experience.

Your application is reviewed by real people.

  • Palo Alto, United States Avature Full time

    Senior Software Engineer - Network ReliabilityLocation: LondonBusiness Area: Engineering and CTORef #: 10040510Description & RequirementsAs a Network Reliability Engineer, you will work within a team of software engineers that are responsible for the tooling, automation & stability of our Global Network Infrastructure that supports Bloomberg products and...


  • Palo Alto, United States Rivian and Volkswagen Group Technologies Full time

    About UsRivian and Volkswagen Group Technologies is a joint venture between two industry leaders with a clear vision for automotive’s next chapter. From operating systems to zonal controllers to cloud and connectivity solutions, we’re addressing the challenges of electric vehicles through technology that will set the standards for software-defined...


  • Palo Alto, United States Luma AI Full time

    Luma’s mission is to build multimodal AI to expand human imagination and capabilities.We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for...


  • Palo Alto, California, United States Testing Solutions GmbH Full time

    Tech Innovator Wanted!We're seeking an experienced Senior Software Engineer to join our infrastructure team at Luma AI. As a key member, you'll be responsible for ensuring the reliability and scalability of our AI research platform. This is a unique opportunity to work on challenging projects that impact the entire organization.About the RoleYou'll...


  • Palo Alto, United States Earnin Full time

    Join our journey to reimagine the way money moves.2023 winner in Inc. Magazine's Best In Business Awards in the Economic/Financial Equity category.Our MissionAs one of the first pioneers of earned wage access, our mission at EarnIn is to make financial momentum accessible to everyone.Learn & GrowWe're committed to growing your career with a mentoring...


  • Palo Alto, California, United States Wing Full time

    About WingSafe, Fast, and Sustainable Delivery SolutionsWing offers innovative drone delivery services as a safe, fast, and sustainable solution for last mile logistics. The increasing demand for on-demand services has highlighted the inefficiencies of current delivery methods, which are costly, contribute to road accidents, and air pollution. Our highly...


  • Palo Alto, California, United States Testing Solutions GmbH Full time

    Unlock the Future of Multimodal AILuma AI is revolutionizing the field of artificial intelligence by pushing beyond language models and developing more aware, capable, and useful systems. As a Senior Software Engineer in our Reliability team, you will play a critical role in defining, measuring, and improving the reliability of our GPU clusters. Our SRE team...


  • Palo Alto, United States Wing Full time

    About Wing: Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing's fleet of highly automated delivery drones can transport small packages directly...


  • Palo Alto, United States Amazon Full time

    Senior Software Development Engineer, AWS Aurora MySQLAre you interested in building hyper-scale database services in the cloud? Do you want to revolutionize the way databases are built for the cloud? Do you want to have direct and immediate impact on hundreds of thousands of users who use AWS database services?Amazon Aurora is a MySQL-compatible, relational...


  • Palo Alto, United States Wing Aviation Full time

    About Wing: Wing offers drone delivery as a safe, fast, and sustainable solution for last mile logistics. Consumer appetites for on-demand services are increasing, but current delivery methods are inefficient, costly, and contribute to road accidents and air pollution. Wing's fleet of highly automated delivery drones can transport small packages directly...


  • Palo Alto, United States Clockwork Inc Full time

    Clockwork Systems is a well-funded Silicon Valley startup building world-class teams that will transform computer networking and distributed systems. Accurately synchronized clocks are foundational for any real-time distributed system, from electronic trading and distributed ledgers to logging/tracing systems and distributed databases. Founded in 2018 by a...

  • Senior Software Engineer

    53 minutes ago


    Palo Alto, United States Clockwork Full time

    Job DescriptionJob DescriptionClockwork Systems is a well-funded Silicon Valley startup building world-class teams that will transform computer networking and distributed systems.Accurately synchronized clocks are foundational for any real-time distributed system, from electronic trading and distributed ledgers to logging/tracing systems and distributed...


  • Palo Alto, United States Earnin Full time

    Join our journey to reimagine the way money moves.2023 winner in Inc. Magazine's Best In Business Awards in the Economic/Financial Equity category.Our MissionAs one of the first pioneers of earned wage access, our mission at EarnIn is to make financial momentum accessible to everyone.Learn & growWe're committed to growing your career with a mentoring...


  • Palo Alto, California, United States Ford Motor Company Full time

    Company Overview:Ford Motor Company is a leading manufacturer of electric vehicles, striving to deliver industry-leading customer experiences. Our mission is to create a better world where every person is free to move and pursue their dreams.About the Role:We are seeking a highly motivated and experienced Senior Software Engineering Manager to lead our...


  • Palo Alto, California, United States EverCharge, Inc. Full time

    About the RoleEverCharge, Inc. is a leading provider of electric vehicle (EV) charging devices and management systems. We are seeking a highly skilled Senior Software Engineer to join our team in the Bay Area. As a Senior Software Engineer, you will be responsible for designing, implementing, and testing cutting-edge firmware for our...


  • Palo Alto, California, United States Axiom Software Solutions Limited Full time

    About Axiom Software Solutions LimitedWe are a leading provider of innovative software solutions for the automotive and industrial sectors.Job Title: Senior Embedded Software DeveloperLocation: Palo Alto, CaliforniaSalary: $120,000 - $180,000 per annumJob DescriptionWe are seeking an experienced Senior Embedded Software Developer to join our team in Palo...


  • Palo Alto, United States Luma AI Full time

    The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing, and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make...


  • Palo Alto, California, United States Bain & Company Full time

    As a Senior Software Engineering Manager at Bain & Company, you will be responsible for leading high-level strategy to align technical goals with business outcomes for software applications addressing complex problems in various industries. Your key responsibilities include designing, selling, scoping, and staffing team members for developing, optimizing,...


  • Palo Alto, United States Luma AI Full time

    The SRE role at Luma AI sits with the Infrastructure and Research teams and is responsible for our GPU clusters. Luma runs on '000s of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. We need a highly skilled SRE to ensure those clusters are healthy and to build the monitoring and management tools we need to make...


  • Palo Alto, California, United States Amazon Full time

    We are seeking an experienced Senior Software Development Engineer for Aurora to join our team. This role requires expertise in designing, implementing, and maintaining large-scale database systems that provide high availability, reliability, and performance guarantees.You will be responsible for architecting and developing highly scalable distributed...