Support Operations Engineer

4 weeks ago


New York, United States CoreWeave Full time
Job DescriptionJob Description

CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.

About the Team

CoreWeave's Support Operations team ensures peak performance and reliability acrossthousands of nodes in multiple supercomputer clusters, each with tens of thousands of GPUs.

Collaborate with pioneering generative AI labs, world-renowned VFX organizations, and

visionary developers and artists. These innovators leverage our cutting-edge GPU cloud

infrastructure to power their mission-critical workflows and achieve unprecedented capabilities.

About the role:

As a Support Operations Engineer, you will be responsible to deploy, configure, and maintain

CoreWeave's GPU fleet across our growing number of data centers in the U.S., Europe, and

beyond.

What You'll Do:

  • You'll monitor our fleet's health, performance, and reliability for issues through the use of our observability stack - Grafana, Prometheus, Victoria Metrics.
  • You'll use CoreWeave Kubernetes to troubleshoot customer support requests and act as a technical escalation point for the Cloud Support Engineers.
  • You'll learn from your fellow Support Operation Engineer teammates and mentor junior engineers and new hires
  • You'll leverage your knowledge of Linux (Ubuntu) to diagnose, troubleshoot, and rectify bugs across the fabric.
  • You'll assist and collaborate with other teams involved in the management and operation of CoreWeave infrastructure.
  • You'll offer expertise, guidance, and troubleshooting support to ensure the smooth functioning and optimal performance of the clusters.
  • You'll support some of the world's largest bare metal fleets of dedicated servers
  • running the latest NVIDIA H100 GPU technology on Infiniband deployments
  • You'll have a front row seat at the deployment of new CoreWeave supercomputing clusters for unprecedented customer workloads in AI/HPC
  • You'll work hand in hand with our Data Center Technicians to install, configure, and troubleshoot all aspects of data center infrastructure
  • You'll liaison with Cloud Operations to ensure that the CoreWeave platform is scalable, reliable and stable
  • You'll partner with our network engineers and software developers to collect failure logs, reproduce issues, and ultimately solve the world's hardest problems
  • You'll identify, create, and maintain new documentation with our Technical Writing team of troubleshooting workflows, corner case scenarios, and new discoveries
  • You'll serve as a technical liaison on incidents and escalations, communicating with all stakeholders
  • You'll participate in a 24/7 on-call rotation every few months ensuring that mission-critical
  • alerts are addressed for infrastructure resiliency.
  • You'll develop alerting, telemetry, and new metrics to proactively prevent issues across the fleet and reduce need for reactive support

What we look for:

  • A working knowledge of cloud computing, virtualization, and container technologies
  • A working knowledge of Linux - tell us about your favorite Linux distro
  • A working knowledge of Kubernetes and Docker
  • A prior role in Sysadmin, Site Reliability Engineering, DevOps, or Infrastructure Operations
  • A prior role in HPC/AI
  • A knack for solving problems - recognizing technical issues, developing appropriate solutions, and following through to completion
  • A love for creating documentation and processes to better your team's internal knowledge base
  • An interest in building the world's largest bespoke supercomputers for leading AI labs
  • A solid understanding of distributed computing environments and methodologies, such as storage volumes, private networks, load balancers, and virtual machines
  • Excellent communication skills (both written and verbal)
  • Willing to work in a very fast-paced environment with dynamic priorities and ever-changing developments
  • Highly independent engineer yet collaborates well as part of a team
  • Willingness and interest to travel to CoreWeave data centers as needed

Plus Points:

  • Prior experience with computer hardware or server hardware - did you build your own PC at home?
  • Prior experience in a data center as an engineer or a technician - what kind of servers did you work on?
  • Prior experience with NVIDIA GPUs and CUDA technologies
  • Prior experience with SuperMicro, Dell, HP Enterprise, and Gigabyte systems
  • Prior experience with HPC systems
  • Prior experience with AI / ML

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $75,000/year to $110,000/year. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

Hybrid Workplace

If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a little chaos, and we're constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

  • Be Curious at your Core
  • Act like an Owner
  • Empower Employees
  • Deliver Best In-Class Client Experience
  • Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us

Benefits

We offer a competitive salary and benefits, including:

  • Medical, dental and vision insurance - 100% paid for the employee
  • Company paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Tuition Reimbursement
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our offices
  • Weekly massages in NJ office
  • A casual work environment
  • Work culture focused on innovative disruption

California Consumer Privacy Act - California applicants only

CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.

As part of this commitment and consistent with the Americans with Disabilities Act (ADA), CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: careers@coreweave.com.



  • New York, New York, United States TekRecruiter Full time

    Job Overview TekRecruiter is seeking a skilled Operations Support Engineer to join a dedicated platform support operations team. This team is tasked with diagnosing, investigating, and collaborating with various departments (including DevOps and Software Engineering) to address issues within a live production environment. Key Responsibilities Participate in...


  • New York, New York, United States CoreWeave Full time

    CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less...


  • New York, United States US Tech Solutions Full time

    Duration: 24 Months ContractJob Description:Ad Platform Technical Operations Solutions Engineer supporting the Ad Platforms technology team, you are a part of a high-performing team redefining TV advertising. Responsibilities:We’re seeking an ambitious Solutions Engineer who will be responsible for owning day-to-day technical operations for the creative...


  • New York, United States US Tech Solutions Full time

    Duration: 24 Months ContractJob Description:Ad Platform Technical Operations Solutions Engineer supporting the Ad Platforms technology team, you are a part of a high-performing team redefining TV advertising. Responsibilities:We’re seeking an ambitious Solutions Engineer who will be responsible for owning day-to-day technical operations for the creative...


  • New Jersey, New York, or Philadelphia, United States CoreWeave Full time

    About the Team CoreWeave’s Support Operations team ensures peak performance and reliability acrossthousands of nodes in multiple supercomputer clusters, each with tens of thousands of GPUs. Collaborate with pioneering generative AI labs, world-renowned VFX organizations, and visionary developers and artists. These innovators leverage our cutting-edge GPU...

  • Production Engineer

    2 weeks ago


    New York, United States Radial Power Operations LLC Full time

    Job DescriptionJob DescriptionDescription:SummaryRadial Power is a leader in innovative energy solutions, committed to revolutionizing the power industry through cutting-edge technology and sustainable practices. We are dedicated to providing exceptional solutions that drive efficiency, performance, and reliability in power systems. As we continue to grow,...


  • New Jersey, New York, or Philadelphia, United States CoreWeave Full time

    About the TeamCoreWeave's Support Operations team ensures peak performance and reliability acrossthousands of nodes in multiple supercomputer clusters, each with tens of thousands of GPUs.Collaborate with pioneering generative AI labs, world-renowned VFX organizations, andvisionary developers and artists. These innovators leverage our cutting-edge GPU...


  • New York, United States Amphenol Industrial Operations Full time

    SUMMARY An Amphenol Industrial Operations (AIO) Sr., Design Engineer will work with the Design Engineering team on new and existing business spanning many markets and business opportunities. They will be responsible for performing research and creating designs of a new product, and improving existing products of the company, and determining different types...

  • Support Engineer

    2 months ago


    New York, United States EdEx - Education Recruitment Full time

    Job Title: Support EngineerLocation: New York City, NY (Remote-Friendly)My client is a dynamic and innovative technology company based in the heart of New York City. We specialize in providing cutting-edge software solutions to a diverse range of clients. Our mission is to empower businesses with technology that drives efficiency and growth. We are a...


  • New York, United States CoreWeave Full time

    CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry’s fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80%...


  • New York, United States CoreWeave Full time

    Job DescriptionJob DescriptionCoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are...


  • New York, United States CoreWeave Full time

    Job DescriptionJob DescriptionCoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are...


  • New York, United States CoreWeave Full time

    Job DescriptionJob DescriptionCoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are...

  • Operations Engineer

    3 weeks ago


    New York, United States Selby Jennings Full time

    One of the world's leading hedge funds is looking to grow out their Operations Engineering team here in New York City. They are looking to onboard some of the sharpest engineers who are passionate about supporting the industries best SRE's, Quants, and Software Engineers in the world. This team is in a high visibility position supporting multiple functions...

  • Operations Engineer

    3 weeks ago


    New York, United States Selby Jennings Full time

    One of the world's leading hedge funds is looking to grow out their Operations Engineering team here in New York City. They are looking to onboard some of the sharpest engineers who are passionate about supporting the industries best SRE's, Quants, and Software Engineers in the world. This team is in a high visibility position supporting multiple functions...

  • IT Support Engineer

    4 days ago


    New York, New York, United States Jobot Full time

    Job Overview:This position is for a dedicated IT Help Desk Engineer who will play a crucial role in delivering exceptional technical support to our clients. The successful candidate will be responsible for assisting users with their inquiries and resolving issues related to computer systems, hardware, and software.About Us:At Jobot, we value the synergy...


  • New York, New York, United States Voltguard Utilities Ltd Full time

    Job OverviewPosition Summary:The Customer Solutions Engineer is essential in delivering exceptional support to our clientele through field operations, onsite technical instruction, and proactive maintenance services. This role necessitates frequent travel within a specified region and potentially other areas of the country. The ideal candidate is a hands-on...


  • New York, New York, United States Voltguard Utilities Ltd Full time

    Job OverviewPosition Summary:The Customer Solutions Engineer is essential in providing exceptional support to our clients through on-site service, technical training, and proactive maintenance. This role involves frequent travel within a specified region and potentially other areas across the United States. The ideal candidate is a practical engineer who can...


  • New York, United States The Cypress Group Full time

    Job DescriptionJob DescriptionOur client is a leading financial services organization headquartered in New York City with additional offices around the globe. They are currently looking for several IT Desktop Support Engineers to assist front and back office staff across various business units.Qualified candidates will have at least 1+ year of professional...


  • New York, United States Atlas Search Full time

    About A reputable Investment Management firm is looking to add an End User Services Engineer to their team to contribute to End User Support and Desktop Engineering responsibilities Responsibilities Support all levels of users including Portfolio Managers and Executives on day to day end user requests Administer platforms such as Active Directory, Mobile...