Infrastructure Reliability Systems Lead

3 days ago


San Mateo, California, United States Roblox Full time

At Roblox, we are building the tools and platform that empower our global community of developers and creators to bring any experience they can imagine to life.

We're on a mission to connect a billion people with optimism and civility, and looking for talented individuals to help us achieve this vision.

A career at Roblox means you'll be working to shape the future of human interaction, solving unique technical challenges at scale, and helping to create safer, more civil shared experiences for everyone.

About This Role
  • This is a highly cross-functional role where collaboration is key, with not only Networking teams but Infrastructure teams more broadly.

You will lead new initiatives contributing to our mission of driving reliability enhancements and efficiency optimizations for Roblox's global physical network infrastructure.

Key Responsibilities:
  • Design and build cutting-edge network automation, reliability, and self-healing systems.
  • Drive projects from inception to execution to support critical Network projects, including High Performance Computing (HPC), self-healing, and network fault correlation.
  • Lead and collaborate with other engineers in the team and cross-functionally to build scalable automation and reliability systems.
  • Participate in periodic on-call rotation for our systems.
Requirements:
  • Bachelor's degree in a relevant engineering field or equivalent experience.
  • Minimum 8 years of experience in developing network software in Golang and/or Python.
  • Experience building network software for large-scale Production Network Infrastructure, including expertise in HPC Networking.
  • A proven track record of building software for configuration management, device lifecycle management, or network monitoring.
  • Experience in Machine Learning and/or Kubernetes is considered a plus.

The estimated annual salary for this position is $308,710. As a full-time employee, you'll also be eligible for excellent medical, dental, and vision coverage, a rewarding 401(k) program, flexible vacation policy, Roflex – Flexible and supportive work policy, Roblox Admin badge for your avatar, and many other benefits.



  • San Mateo, California, United States Roblox Full time

    Transforming the Future of Human InteractionWe are seeking a highly skilled Technical Lead to join our team and shape the future of Roblox's infrastructure deployment. As a Senior Software Engineer, Continuous Deployment, you will play a critical role in designing and implementing seamless deployment strategies that empower our community to bring any...


  • San Francisco, California, United States Informal Systems Full time

    About the RoleWe are looking for a Senior Manager to lead our Informal Staking team. As the Technical Engineering Manager, you will oversee the team's performance and development, driving growth and profitability. You will work closely with the technical and project delivery teams to ensure seamless operation and assign people to projects based on their...


  • San Mateo, California, United States Snowflake Computing Full time

    Key ResponsibilitiesThe successful candidate will have a deep understanding of cloud computing technologies and a strong background in software development. Key responsibilities will include:Designing and implementing scalable and secure cloud-based infrastructureCollaborating with cross-functional teams to drive innovation and growthEnsuring the reliability...


  • San Francisco, California, United States Unreal Gigs Full time

    Job Title: System Reliability ManagerCompany Overview:">">We're a forward-thinking company that values expertise and teamwork.">">Salary: $130,000 per year">">Job Description:">">We're looking for a seasoned System Reliability Manager to oversee the reliability and scalability of our cloud infrastructure.">">Key Responsibilities:">">">">Cross-Functional...


  • San Francisco, California, United States Crusoe Full time

    About CrusoeCrusoe is a pioneering company in the field of AI-first cloud infrastructure. Our mission is to align the future of computing with the future of the climate. We're redefining AI cloud infrastructure and recognized as the 'gold standard' for reliability and performance.About the RoleWe're seeking an experienced SRE Manager to lead our 24/7 Site...


  • San Francisco, California, United States Focal Systems Full time

    **About Us:** Focal Systems is a leading retail AI solutions company based in Silicon Valley, dedicated to automating and optimizing brick-and-mortar retail using deep learning computer vision. We are a rapidly growing startup that has more than doubled in size every year since inception.**Salary and Benefits:** This role comes with an estimated salary of...


  • San Mateo, California, United States Snowflake Computing Full time

    About This OpportunityWe're seeking a Senior Software Engineer to join our backend team, focusing on building and maintaining infrastructure that supports Snowsight, our cutting-edge UI. This role involves designing, developing, and deploying backend services, features, and tools that ensure a seamless user experience.Your Key ResponsibilitiesServices:...


  • San Francisco, California, United States OpenAI Full time

    We are seeking an experienced Reliability Systems Architect to join our team at OpenAI in San Francisco.This role involves designing and implementing scalable infrastructure solutions that meet the rapidly increasing demands of our users. As a key member of our engineering team, you will collaborate with cross-functional teams to ensure the reliability,...


  • San Mateo, California, United States Verkada Full time

    Role OverviewWe are seeking an experienced Infrastructure Systems Developer to join our team. In this role, you will design, implement, and troubleshoot custom developed frontend and backend tools/platforms for business teams. You will work closely with our developers to enable them to easily put quality features into the hands of users.


  • San Francisco, California, United States Oven Full time

    About Our CompanyBun, an open-source JavaScript tooling company, seeks to make programming more accessible. Backed by significant investments from top investors in Silicon Valley, we've gained recognition as one of the top GitHub repositories, boasting a vibrant community of over 33,000 Discord members.As part of our team, you'll play a crucial role in...


  • San Diego, California, United States BAE Systems USA Full time

    About the RoleThis is an exciting opportunity to join a dynamic team at BAE Systems USA as a System Infrastructure Specialist. In this role, you will be responsible for maintaining and developing all Linux infrastructure technology to maintain 24x7x365 uptime service.You will work proactively to engineer systems administration-related solutions for various...

  • Infrastructure Lead

    7 hours ago


    San Francisco, California, United States Naptha AI Full time

    Naptha AI is looking for a talented Cloud-Scale Distributed Systems Engineer to lead the development of our AI infrastructure. You will be responsible for designing and implementing scalable infrastructure for massive agent networks, architecting systems for efficient agent communication and coordination, and building robust, distributed systems for agent...


  • San Mateo, California, United States Roblox Full time

    At Roblox, we are building a cutting-edge platform that enables our community of developers and creators to bring their imagination to life. Our vision is to create a world where people can come together, interact, and have fun in a safe and civil environment.We are seeking an experienced Infrastructure Engineer to join our Machine Learning Platform team. As...


  • San Mateo, California, United States Snowflake Computing Full time

    Our MissionAt Snowflake Computing, our mission is to empower every organization on the planet to be data-driven. We believe that data has the power to transform businesses and societies, and we are committed to helping our customers unlock the full potential of their data. As a Cloud Infrastructure Engineer, you will play a critical role in helping us...


  • San Diego, California, United States Qualcomm Full time

    About UsQualcomm is a leading technology company that develops innovative solutions for mobile devices, automotive, and IoT industries. Our team is passionate about delivering high-quality products and services that exceed customer expectations.Job DescriptionWe are seeking a highly skilled System Reliability Specialist to join our team at Qualcomm. As a key...


  • San Francisco, California, United States Airtable Full time

    Job Responsibilities:• Proactively identify and lead significant improvements to Airtable's infrastructure, working across teams and product areas to maximize business and engineering impact.• Work on systems-level problems in a complex design space where scalability, efficiency, reliability, and security really matter.• Build clean, reusable, and...


  • San Mateo, California, United States Verkada Full time

    About the RoleYou will be at the forefront of developing and sustaining Verkada's growth infrastructure, designing robust systems that underpin our most critical strategies and initiatives.Key responsibilities include:Designing and implementing reliable infrastructure for our growth engine.Maintaining and enhancing backend systems for the purchase journey...


  • San Mateo, California, United States Verkada Full time

    OverviewVerkada is a leading provider of cloud-based B2B physical security solutions. Our platform integrates video security cameras, access control, environmental sensors, alarms, workplace and intercoms to provide real-time insights into an organization's physical environment.We are committed to empowering our customers with the tools they need to minimize...


  • San Mateo, California, United States IXL Learning Full time

    Senior Software Engineer - Site ReliabilityWe are seeking experienced Senior Software Engineers to join our Site Reliability team at IXL Learning, a leading developer of personalized learning products used by millions globally.This is an opportunity to work with various production technology stacks and take responsibility for site performance, uptime, and...


  • San Francisco, California, United States Federal Reserve Bank of San Francisco Full time

    We are the Federal Reserve Bank of San Francisco, a public servant with a mission to advance the nation's monetary, financial, and payment systems.The position of Sr./Lead Site Reliability Engineer at the Federal Reserve Bank of San Francisco involves working closely with Cash Application Delivery Services (ADS) development, QA, DevOps, and National IT...