Senior Site Reliability Engineer

4 weeks ago


Palo Alto CA, United States Mumba Technologies, Inc. Full time

About the Role
Find out exactly what skills, experience, and qualifications you will need to succeed in this role before applying below.
We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with software engineers, research scientists, and other cross-functional teams to develop and maintain reliable and scalable infrastructure that enables rapid iteration and deployment of our products.
Key Responsibilities
Design and implement infrastructure automation and deployment pipelines using tools such as Terraform
Implement and maintain monitoring and logging systems to ensure the reliability and performance of our healthcare AI platform
Work closely with software engineers to design and deploy scalable, fault-tolerant, and secure production systems on cloud platforms such as AWS, GCP, or Azure
Develop and maintain security and compliance policies and procedures for our healthcare AI platform
Collaborate with cross-functional teams to troubleshoot and resolve complex issues related to infrastructure, deployment, and operations
Implement and maintain disaster recovery and business continuity plans
Develop and maintain documentation related to infrastructure, deployment, and operations
Mentor and provide technical guidance to junior engineers
Qualifications
Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field
At least 5 years of professional experience as SRE
Strong skills in building cloud infra orchestration systems (Operators) using python, Go
Expertise in infrastructure automation and deployment tools such as Terraform, or GitLab CI/CD
Experience with cloud platforms such as AWS, GCP, or Azure
Strong knowledge of containerization technologies such as Docker and Kubernetes
Experience with monitoring and logging tools such as ELK, Grafana, or Datadog
Familiarity with security and compliance best practices and tools such as HashiCorp Vault, AWS KMS, or Azure Key Vault
Strong problem-solving skills and ability to work independently and collaboratively in a team environment
Excellent communication and interpersonal skills
Experience implementing HIPAA and SOC2 compliance in a plus
Experience working in an HPC Environment is a plus



  • Palo Alto, United States Signify Technology Full time

    Senior Site Reliability Engineer Job Title: Senior Site Reliability Engineer Job Type: Permanent Salary: Dependent on experience Role Location: On-site — Palo Alto, CA The Company A well‑established tech organization building advanced AI products for healthcare and clinical research. The team focuses on secure, reliable platforms that process sensitive...


  • Palo Alto, California, United States Glean Full time

    About Glean: Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry's most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and...


  • Palo Alto, United States Xai Full time

    About xAIxAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational...


  • Palo Alto, United States Mumba Technologies, Inc. Full time

    We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with software engineers,...


  • Palo Alto, United States Mumba Technologies, Inc. Full time

    About the Role We are seeking a highly skilled Senior Site Reliability Engineer to join our team. In this role responsibilities will include designing and implementing infrastructure automation, continuous integration and delivery pipelines, and monitoring and scaling the infrastructure that powers our healthcare AI platform. You will work closely with...


  • Palo Alto, United States Tesla Full time

    Site Reliability Engineer, HPC InfrastructureJoin to apply for the Site Reliability Engineer, HPC Infrastructure role at TeslaWhat To Expect Tesla's Supercomputing/AI infrastructure team works directly with the high-performance computing and machine learning infrastructure on which our ML algorithms run; this includes virtual simulations, Autopilot hardware...


  • Palo Alto, United States FLUIX Full time

    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America’s compute capacity...


  • Palo Alto, CA, United States Energy Jobline ZR Full time

    Energy Jobline is the largest and fastest growing global Energy Job Board and Energy Hub. We have an audience reach of over 7 million energy professionals, 400,000+ monthly advertised global energy and engineering jobs, and work with the leading energy companies worldwide. We focus on the Oil & Gas, Renewables, Engineering, Power, and Nuclear markets as well...


  • Palo Alto, United States ASSURED Full time

    Join to apply for the Staff Site Reliability Engineer role at Assured This range is provided by Assured. Your actual pay will be based on your skills and experience talk with your recruiter to learn more. Base pay range $180,000.00/yr - $210,000.00/yr Assured is on a mission to modernize insurance. Claims processing (i.e. should we pay this claim?), while...


  • Palo Alto, United States Rivian and Volkswagen Group Technologies Full time

    Overview We are seeking a Senior Site Reliability Engineer (SRE) specializing in Observability to join RivianVW's Data Platform - Production Engineering team. In this role, you will design, implement, and scale robust observability systems to ensure the health, performance, and reliability of our production environment. You will collaborate closely with...