Senior Infrastructure Engineer

2 days ago


Santa Clara, California, United States Boson AI Full time
About Boson AI

Boson AI is a pioneering startup dedicated to developing cutting-edge language tools for global use. Our team of visionary scientists and engineers, led by Alex Smola and Mu Li, is pushing the boundaries of generative AI models for language and beyond.

The Role

We are seeking a highly skilled Senior System Administrator to join our team in Toronto and help us operate our data center deployment. The ideal candidate will possess strong problem-solving skills and the ability to learn new tools quickly.

The successful candidate will have experience with Slurm, MAAS, Ceph, OPNSense, and related technologies. They will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems, including managing private large high-end GPU clusters, configuring and maintaining network switches, and configuring and maintaining MAAS, Ceph, and Slurm.

The role offers a unique opportunity to work with the latest NVIDIA H100 GPUs, extensive storage, Terabit networking, and hundreds of computers. The successful candidate will be responsible for the full lifecycle of physical systems, including deployments of new hardware, operations, triage, and troubleshooting.

Key Responsibilities
  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk TH3, Mellanox Infiniband)
  • Configure and maintain MAAS (metal as a service), Ceph, and Slurm
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network and security tools, including VPN, VLAN, DHCP, SSO, MFA
  • Learn about new tools and deploy them
Requirements
  • Strong background in system operations, including Slurm, Ansible, MAAS, Ceph, OPNsense and Kubernetes
  • Experience with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro
Compensation

$125,000 - $250,000 a year

The ability to solve problems and to learn new techniques is key.



  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Cloud Infrastructure EngineerNVIDIA is seeking a highly skilled Senior Cloud Infrastructure Engineer to join our Infrastructure, Planning and Process (IPP) team. As a key member of our global organization, you will be responsible for designing, building, and maintaining our cloud infrastructure to support the development and deployment of...


  • Santa Clara, California, United States Pan Asia Resources Full time

    Job Title: Senior Systems Infrastructure EngineerWe are seeking a highly skilled Senior Systems Infrastructure Engineer to join our team at Pan Asia Resources. As a key member of our infrastructure team, you will be responsible for designing, implementing, and maintaining our cloud infrastructure on AWS.Key Responsibilities:Design and implement scalable and...


  • Santa Clara, California, United States Trillium Staffing Full time

    Senior SRE EngineerTrillium Staffing is seeking a seasoned Senior SRE Engineer to join its fast-paced Infrastructure, Planning and Processes organization in Santa Clara, CA. As a key member of the team, you will be responsible for developing and maintaining sophisticated internal cloud provisioning products for GPUs and Tegra systems.Key...


  • Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a highly skilled Senior SRE Engineer to join its Infrastructure, Planning and Processes organization. As a key member of the team, you will be responsible for designing and implementing scalable, resilient cloud infrastructure platforms using Kubernetes and other technologies.Key ResponsibilitiesDesign and implement Kubernetes...


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: Senior Staff AI Infrastructure SREXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures cutting-edge EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.About the...


  • Santa Clara, California, United States NVIDIA Full time

    Senior Software Engineer - HPC Infrastructure SpecialistNVIDIA is a pioneer in the field of high-performance computing, and we're seeking a talented Senior Software Engineer to join our team. As a key member of our HPC infrastructure team, you will be responsible for designing and implementing scalable systems to meet the demands of our high-performance...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Cloud Infrastructure Engineer to join our team at Palo Alto Networks. As a key member of our infrastructure team, you will be responsible for designing, building, and operating our cloud infrastructure to ensure high availability, scalability, and security.Key ResponsibilitiesDesign and implement cloud...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs stands at the forefront of innovative connectivity solutions, enabling the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform seamlessly integrates PCIe, CXL, and Ethernet semiconductor-based solutions alongside the COSMOS software suite, delivering a software-defined architecture that is both scalable and...


  • Santa Clara, California, United States Microsoft Corporation Full time

    Job Title: Technical Program Manager IIJob Summary:We are seeking a highly skilled Technical Program Manager II to join our New Technology Engineering (NTE) organization at Microsoft Corporation. As a Technical Program Manager, you will play a critical role in leading the engineering, validation, and production readiness for new cloud service capacity. Your...


  • Santa Clara, California, United States Mindlance Full time

    Job Title: Cloud Infrastructure EngineerWe are seeking a highly skilled Cloud Infrastructure Engineer to join our team at Mindlance. As a Cloud Infrastructure Engineer, you will be responsible for designing, implementing, and managing cloud-based infrastructure solutions for our clients.Key Responsibilities:Design and implement cloud-based infrastructure...

  • Software Engineer

    3 weeks ago


    Santa Clara, California, United States Oracle Full time

    Software Engineer - Cloud Engineering Infrastructure DevelopmentOracle is seeking a skilled Software Engineer to design, develop, and troubleshoot software programs for various purposes, including file storage, databases, applications, and tools networks.Key Responsibilities:Collaborate with cross-functional teams to define and develop software for tasks...


  • Santa Clara, California, United States Sage Lake Senior Living Full time

    About the RoleWe are seeking a seasoned Senior SRE Engineer to join our team at Sage Lake Senior Living, where you will play a critical role in ensuring the high availability and performance of our AI-powered applications.Key ResponsibilitiesOperate and improve the observability and maintainability of our distributed microservice cloud applications and...


  • Santa Clara, California, United States NVIDIA Full time

    The NVIDIA GPU Cloud (NGC) team is seeking experienced software engineers to develop NVIDIA's advanced compute cloud solutions. These solutions encompass software for managing hardware and network provisioning to create a multi-tenant infrastructure. As a software engineer, you will collaborate with fellow engineers, product architects, and product managers...


  • Santa Clara, California, United States NVIDIA Full time

    We are currently seeking a Lead Infrastructure Solutions Engineer at NVIDIA. This role is designed for a driven, innovative, and skilled senior software engineer to join our CPU Infrastructure team. Our focus is on developing methodologies and crafting tools that enhance the design and verification processes of NVIDIA's CPU and SOC architectures. This...


  • Santa Clara, California, United States Oracle Full time

    Job DescriptionJob Summary: We are seeking a highly skilled and experienced Senior Principal Software Engineer to join our Cloud Engineering Infrastructure Development team at Oracle. As a key member of our team, you will be responsible for designing, developing, and performance tuning the networking stack required to run distributed AI/ML/HPC workloads...


  • Santa Clara, California, United States Colovore Full time

    Job Title: Senior DevOps EngineerColovore is a rapidly growing company that specializes in providing innovative data center solutions. We are seeking a highly skilled Senior DevOps Engineer to join our team.Job OverviewThe Senior DevOps Engineer will be responsible for designing, implementing, and maintaining our IT systems, including network, server, and...


  • Santa Clara, California, United States XPENG Motors Full time

    About XPeng MotorsXpeng Motors is a leading innovator in the electric vehicle industry, dedicated to designing, developing, and manufacturing cutting-edge smart electric vehicles that seamlessly integrate advanced Internet, AI, and autonomous driving technologies.Job SummaryWe are seeking a highly skilled Senior Staff AI Infrastructure Site Reliability...


  • Santa Clara, California, United States Apple Full time

    Overview:Weekly Hours: 40 Role Number: The Device Services (DS) Infrastructure team is in search of a detail-oriented and results-focused lab engineer. We oversee a vast array of Apple devices utilized in a centralized, automated testing framework, which serves as an essential resource for engineers working on iOS, watchOS, tvOS, and macOS development.This...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Principal Cloud Reliability Engineer to join our team. As a key member of our cloud infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of our cloud infrastructure team by...

  • Senior Cloud Engineer

    2 weeks ago


    Santa Clara, California, United States NVIDIA Full time

    About the RoleNVIDIA is seeking a seasoned Cloud Engineer to join its fast-paced Infrastructure, Planning and Processes organization. As a Senior Cloud Engineer, you will be part of a dynamic team that develops and maintains NVIDIA's internal cloud provisioning product for GPUs and Tegra systems.Key ResponsibilitiesDesign and implement scalable, resilient...