Infrastructure Engineer

1 month ago


Santa Clara, United States TEKsystems co Allegis Group Full time
This is a 6-month contract position. This could be extended, although this not guaranteed. No C2C or sub contractors.
Top Skills' Details
Bare Metal GPU Provisioning
OS Installation
Infrastructure Automation
Scripting
IPMI
MUST BE IN PACIFIC OR MOUNTAIN TIME ZONE OR WILLING TO WORK THESE HOURS
Description:
The objective for each contractor is to bring-up and maintain our client's DGX Cloud infrastructure services running on top of bare metal.
We are seeking a contractor to deploy our client's infrastructure services on bare metal.
In this organization, you will deploy and ensure that our infrastructure services atop of our hardware for accelerated computing are running as reliably as needed.
What you'll do:
Deploy and run cloud infrastructure services in scope to meet our business goals performing migrations and decommissions as necessary.
Eliminate toil or automate it where the ROI of building and maintaining automation is worth it.
Practice sustainable blameless incident prevention and incident response while being a member of an on call rotation.
No prior experience having worked in a team of any particular name or having worked in a ML/AI focused team are required but also a nice to have.
Skills:
Bare Metal, IPMI, BMC, GPU, Ansible, Python, NCCL, Slurm, Docker, Kubernetes, Go, Perl, Ruby, Bash, SRE, DevOps, CRE
Top Skills Details:
Bare Metal,IPMI,BMC,GPU
Additional Skills & Qualifications:
Ways to stand out from the crowd:
Experience working with GPUs and ancillary services and hardware on bare metal.
Experience working with or developing bare metal as a service (BMaaS) associated systems.
Experience working with or developing multi-cloud infrastructure services.
Experience teaching reliability (e.g SRE) or more general cloud systems good practices to peers or to other companies (e.g CRE).
Experience in running private or public cloud systems based on one or more of Kubernetes, OpenStack, Docker or Slurm.
Experience with our client's Collective Communication Library (NCCL).
Experience Level:
Intermediate Level
o Eligibility requirements apply to some benefits and may depend on your job classification
and length of employment. Benefits are subject to change and may be subject to
specific elections, plan, or program terms. If eligible, the benefits
available for this temporary role may include the following:
Medical, dental & vision
Critical Illness, Accident, and Hospital
401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available
Life Insurance (Voluntary Life & AD&D for the employee and dependents)
Short and long-term disability
Health Spending Account (HSA)
Transportation benefits
Employee Assistance Program
Time Off/Leave (PTO, Vacation or Sick Leave)

About TEKsystems:

We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company.

The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law. Infrastructure Engineer

  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is seeking a highly skilled and experienced engineer to join our growing team. The successful candidate will work at the intersection of GPU chip design and AI, responsible for the design, development, and maintenance of the infrastructure around Nvidia's internal large language model aimed at facilitating chip design.Key Responsibilities:Develop and...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs: Transforming Data-Driven ApplicationsAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure.Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Senior IT Infrastructure EngineerJob Summary: We are seeking a highly skilled Senior IT Infrastructure Engineer to join our team at Diverse Lynx LLC. The ideal candidate will have expertise in VMware and OpenShift, with a strong focus on capacity planning, migrations, architectural planning, and operational issue resolution.Key...


  • Santa Clara, California, United States NVIDIA Full time

    Job Title: Senior Site Reliability EngineerNVIDIA is seeking a highly skilled Senior Site Reliability Engineer to join our Infrastructure, Planning and Process (IPP) team. As a key member of our global organization, you will play a critical role in designing and implementing scalable, reliable, and efficient cloud infrastructure solutions.Our cloud services...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure platform team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our infrastructure platform stack includes Terraform, Kubernetes, GitLab...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About UsPalo Alto Networks is a leading cybersecurity company that protects the digital way of life. Our mission is to be the cybersecurity partner of choice, and we're committed to providing innovative solutions to prevent cyberattacks.Job DescriptionWe're seeking a highly skilled Senior Staff DevOps Engineer to join our CDL/SLS team. As a key member of our...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff DevOps Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our infrastructure platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps, Prometheus, Grafana,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is a leader in the cybersecurity industry, and we're looking for a talented Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud services.ResponsibilitiesDesign, build, maintain, and scale production...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Cloud Infrastructure Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using Terraform,...


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: AI Infrastructure Engineer - Scalable SolutionsXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures smart EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.We are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of our SRE and DevOps teams by developing...


  • Santa Clara, California, United States NVIDIA Full time

    Job DescriptionNVIDIA is seeking a Senior Site Reliability Engineer to join our AI Efficiency Team. As a key member of this team, you will contribute to the development of infrastructure that powers our innovative AI research.The AI Efficiency Team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data...


  • Santa Clara, California, United States NVIDIA Full time

    We are seeking a highly motivated Senior Cloud Infrastructure Engineer to join our Embedded organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The ideal candidate will focus on ensuring production health to...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization.The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services.Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once...


  • Santa Clara, United States NVIDIA Corporation Full time

    Software Engineering Manager - Cloud Infrastructure ServicesWe are seeking an experienced manager of Systems Engineers and Software Engineers to build and run reliable cloud infrastructure services at scale close to the bare metal. In this organization, you will develop one or more teams to ensure that our internal and external facing cloud services atop of...


  • Santa Clara, California, United States Astera Labs Full time

    Astera Labs Job DescriptionAstera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of AI and cloud infrastructure. Our Intelligent Connectivity Platform integrates PCIe, CXL, and Ethernet semiconductor-based solutions and the COSMOS software suite of system management and optimization tools to deliver a...


  • Santa Clara, California, United States eTeam Full time

    Job Title: Cloud Infrastructure ArchitectWe are seeking a highly skilled Cloud Infrastructure Architect to join our eTeam team. As a key member of our team, you will be responsible for designing and implementing scalable, secure, and efficient cloud infrastructure solutions on Google Cloud Platform (GCP).Key Responsibilities:Design and implement cloud...


  • Santa Clara, California, United States eTeam Full time

    Job Title: Cloud Infrastructure ArchitectWe are seeking a highly skilled Cloud Infrastructure Architect to join our team at eTeam. As a key member of our infrastructure team, you will be responsible for designing, implementing, and maintaining scalable and secure cloud infrastructure on Google Cloud Platform.Key Responsibilities:Design and implement cloud...


  • Santa Clara, California, United States Nvidia Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly motivated and experienced Senior Site Reliability Engineer to join our Embedded organization. This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.Key...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Description:At Diverse Lynx LLC, we are seeking a skilled Cloud Engineer to join our team. As a key member of our infrastructure team, you will be responsible for designing, implementing, and maintaining our cloud infrastructure. Key Responsibilities:Design and implement cloud infrastructure solutions using AWS, Azure, or Google Cloud...