Support Operations Engineer

3 weeks ago


New York, New York, United States CoreWeave Full time

CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at


CoreWeave's Support Operations team ensures peak performance and reliability across thousands of nodes in multiple supercomputer clusters, each with tens of thousands of GPUs. Collaborate with pioneering generative AI labs, world-renowned VFX organizations, and visionary developers and artists. These innovators leverage our cutting-edge GPU cloud infrastructure to power their mission-critical workflows and achieve unprecedented capabilities.

As a Support Operations Engineer, you will be responsible for deploying, configuring, and maintaining CoreWeave's GPU fleet across our growing number of data centers in the U.S., Europe, and beyond.

You'll monitor our fleet's health, performance, and reliability for issues through the use of our observability stack - Grafana, Prometheus, Victoria Metrics.
You'll use CoreWeave Kubernetes to troubleshoot customer support requests and act as a technical escalation point on Infrastructure issues for the Customer Success organization.
You'll learn from your fellow Support Operation Engineer teammates and mentor junior engineers and new hires.
You'll leverage your knowledge of Linux (Ubuntu) to diagnose, troubleshoot, and rectify bugs across the fabric.
You'll assist and collaborate with other teams involved in the management and operation of CoreWeave infrastructure.
You'll offer expertise, guidance, and troubleshooting support to ensure the smooth functioning and optimal performance of the clusters.
You'll support some of the world's largest bare metal fleets of dedicated servers running the latest NVIDIA H100 GPU technology on Infiniband deployments.
You'll have a front row seat at the deployment of new CoreWeave supercomputing clusters for unprecedented customer workloads in AI/HPC.
You'll work hand in hand with our Data Center Technicians to install, configure, and troubleshoot all aspects of data center infrastructure.
You'll liaison with Cloud Operations to ensure that the CoreWeave platform is scalable, reliable and stable.
You'll partner with our network engineers and software developers to collect failure logs, reproduce issues, and collaborate to resolve complex client and internal problems.
You'll identify, create, and maintain new documentation with our Technical Writing team of troubleshooting workflows, corner case scenarios, and new discoveries.
You'll serve as a technical liaison on incidents and escalations, communicating with all stakeholders.
You'll participate in a 24/7 on-call rotation every few months ensuring that mission-critical alerts are addressed for infrastructure resiliency.
You'll develop alerting, telemetry, and new metrics to proactively prevent issues across the fleet and reduce need for reactive support.

A knack for solving problems - recognizing technical issues, developing appropriate solutions, and following through to completion
A love for creating documentation and processes to better your team's internal knowledge base
An interest in building the world's largest bespoke supercomputers for leading AI labs
A solid understanding of distributed computing environments and methodologies, such as storage volumes, private networks, load balancers, and virtual machines
Excellent communication skills (both written and verbal)
Willing to work in a very fast-paced environment with dynamic priorities and ever-changing developments
Highly independent engineer yet collaborates well as part of a team
A working knowledge of cloud computing, virtualization, and container technologies
A working knowledge of Linux - tell us about your favorite Linux distro
A working knowledge of Kubernetes and Docker
A prior role in Sysadmin, Site Reliability Engineering, DevOps, or Infrastructure Operations
A prior role in HPC/AI
Willingness and interest to travel to CoreWeave data centers as needed

Prior experience with computer hardware or server hardware - did you build your own PC at home?
Prior experience in a data center as an engineer or a technician - what kind of servers did you work on?
Prior experience with NVIDIA GPUs and CUDA technologies
Prior experience with SuperMicro, Dell, HP Enterprise, and Gigabyte systems
Prior experience with HPC systems
Prior experience with AI / ML

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $75,000/year to $110,000/year. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.

If you reside within a 30-mile radius of our New Jersey, New York, or Philadelphia offices, we're excited for you to join us at the office at least three times a week, recognizing the significance we place on fostering connections, collaboration, and creativity within our office culture. Our commitment to operating as a hybrid workplace underscores our dedication to enabling our employees to tailor their work-life balance to their individual preferences.

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a little chaos, and we're constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at your Core
Act like an Owner
Empower Employees
Deliver Best In-Class Client Experience
Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us

We offer a competitive salary and benefits, including:

Medical, dental and vision insurance - 100% paid for the employee
Company paid Life Insurance
Voluntary supplemental life insurance
Short and long-term disability insurance
Flexible Spending Account
Tuition Reimbursement
Mental Wellness Benefits through Spring Health
Family-Forming support provided by Carrot
Paid Parental Leave
Flexible, full-service childcare support with Kinside
401(k) with a generous employer match
Flexible PTO
Catered lunch each day in our offices
Weekly massages in NJ office
A casual work environment
Work culture focused on innovative disruption

California Consumer Privacy Act - California applicants only



  • New York, New York, United States TekRecruiter Full time

    Job Overview TekRecruiter is seeking a skilled Operations Support Engineer to join a dedicated platform support operations team. This team is tasked with diagnosing, investigating, and collaborating with various departments (including DevOps and Software Engineering) to address issues within a live production environment. Key Responsibilities Participate in...

  • IT Support Engineer

    4 days ago


    New York, New York, United States Jobot Full time

    Job Overview:This position is for a dedicated IT Help Desk Engineer who will play a crucial role in delivering exceptional technical support to our clients. The successful candidate will be responsible for assisting users with their inquiries and resolving issues related to computer systems, hardware, and software.About Us:At Jobot, we value the synergy...


  • New York, New York, United States Voltguard Utilities Ltd Full time

    Job OverviewPosition Summary:The Customer Solutions Engineer is essential in delivering exceptional support to our clientele through field operations, onsite technical instruction, and proactive maintenance services. This role necessitates frequent travel within a specified region and potentially other areas of the country. The ideal candidate is a hands-on...


  • New York, New York, United States Voltguard Utilities Ltd Full time

    Job OverviewPosition Summary:The Customer Solutions Engineer is essential in providing exceptional support to our clients through on-site service, technical training, and proactive maintenance. This role involves frequent travel within a specified region and potentially other areas across the United States. The ideal candidate is a practical engineer who can...


  • New York, New York, United States Mondrian Alpha Full time

    Mondrian Alpha is in search of a Senior IT Support Engineer to enhance our dynamic team. This role is pivotal as we expand our operations, particularly in our New York office, which is a focal point for growth and investment.The ideal candidate will act as the primary resource for all technical support inquiries within the New York office while also...


  • New York, New York, United States Jobot Full time

    Job Overview:This position is for a dedicated IT Help Desk Engineer who will play a crucial role in delivering exceptional support to our clients and internal teams. About Us:At Jobot, we are committed to excellence in customer service and the seamless operation of our technology systems. Our culture is built on kindness, transparency, and teamwork,...


  • New York, New York, United States Axiom Technologies Full time

    Axiom Technologies is a premier IT Services partner catering to medium and large-scale enterprises. For further insights into our offerings, please explore our website. Position Overview: We are seeking a highly skilled Technical Support Engineer specializing in VIP IT support within a trading environment. This role is critical for ensuring seamless...


  • New York, New York, United States Axiom Technologies Full time

    Axiom Technologies is a premier IT Services partner dedicated to supporting medium to large-scale enterprises. For more insights into our offerings, please visit our website. Position Overview: We are seeking a skilled Technical Support Engineer specializing in VIP IT support for an initial rolling contract. Key Responsibilities: Delivering on-site,...


  • New York, New York, United States Axiom Technologies Full time

    Axiom Technologies is a leading provider of IT services, catering to medium and large enterprises. For more details about our offerings, please visit our website. Position Overview: We are seeking a skilled Technical Support Engineer to deliver exceptional IT support to VIP clients within a dynamic trading environment. This role is crucial for ensuring...


  • New York, New York, United States Talan Full time

    Join Talan, a prominent leader in technology innovation and transformationWe are seeking a talented Support Engineer to become a vital part of our team, focusing on trading environments. The ideal candidate will possess strong knowledge in Front-office operations, SQL, Linux, and Python.As a member of our organization, you will engage with state-of-the-art...


  • New York, New York, United States Voltguard Utilities Ltd Full time

    Job OverviewPosition Summary:The Customer Solutions Engineer is essential in delivering exceptional support to our clients through on-site service, technical education, and proactive maintenance. This role involves frequent travel within a specified region and potentially to other areas across the country. The ideal candidate is a practical engineer capable...


  • New York, New York, United States Connect-IT Full time

    Job Title: Infrastructure Support EngineerLocation: Manhattan – Midtown SouthWorking Pattern: Primarily working in a collaborative office environment with occasional client site visits.OverviewConnect-IT is collaborating with a prominent IT Service Provider in the Global Financial Services Sector.We are seeking a candidate who can operate effectively both...


  • New York, New York, United States VTS Full time

    We are experiencing significant growth and are seeking a skilled Lead Technical Support Engineer to become a part of our dynamic team. The ideal candidate will possess a strong enthusiasm for Customer Support along with a keen eye for detail. In this role, you will be an integral member of the Engineering Production Support Team, tasked with diagnosing and...


  • New York, New York, United States Jane Street Full time

    About the RoleWe are seeking an IT Operations Engineer to deliver technical assistance for users in our automated and manual trading environments, which encompass a combination of proprietary and third-party applications. Technology is integral to our operations, and you will be essential in ensuring we achieve our trading goals.Exceptional interpersonal...


  • New York, New York, United States Jane Street Full time

    Overview of the Program Our objective is to provide you with an authentic experience of working at Jane Street. Throughout your internship, you will delve into innovative approaches to tackle intriguing challenges within your area of expertise through engaging classes, interactive discussions, and collaborative sessions. You will also have the opportunity to...


  • New York, New York, United States XARM Full time

    XARM by UFACTORY is a prominent producer of multi-axis robotic systems, focusing on 6-8 axis models utilized in diverse sectors. Our robots are engineered to perform tasks with accuracy and effectiveness, catering to applications ranging from educational purposes to fully operational manufacturing setups.As we enhance our operations, we are looking for a...


  • New York, New York, United States TekRecruiter Full time

    Job Overview TekRecruiter is seeking a dedicated Tech Operations Engineer to join a dynamic platform support operations team. This role focuses on troubleshooting, investigating, and collaborating with various teams, including DevOps and Software Engineering, to address issues within a production environment. Key Responsibilities Contribute to a 24/7...


  • New York, New York, United States TekRecruiter Full time

    Job Overview TekRecruiter is seeking a dedicated Tech Operations Engineer to join a dynamic platform support operations team. This role involves diagnosing, investigating, and collaborating with various teams (including DevOps and Software Engineering) to address issues within a production setting. Key Responsibilities Participate in a 24/7 operations team...


  • New York, New York, United States Eze Castle Integration Full time

    Eze Castle Integration is a premier global provider of managed services, cybersecurity, and business transformation solutions tailored for mid-market financial services organizations worldwide.With an extensive portfolio of services, Eze Castle Integration ensures stability, security, and enhanced business performance, allowing clients to concentrate on...


  • New York, New York, United States City of New York Full time

    Position OverviewThe City of New York's Department of Environmental Protection (DEP) plays a crucial role in safeguarding the environment and public health for all residents by ensuring the delivery of over 1.1 billion gallons of high-quality drinking water daily, while effectively managing wastewater and stormwater systems. As the largest municipal water...