Engineering Manager, HPC Kubernetes Platform

2 weeks ago


Dallas, TX, United States NorthMark Strategies Full time
The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

We are seeking an experienced Engineering Manager, HPC Kubernetes Platform to lead the team responsible for designing and scaling our bare-metal Kubernetes environment-the orchestration layer powering GPU- and CPU-intensive machine-learning and HPC workloads across global datacenters.

This is a hands-on leadership role focused on platform performance, reliability, and automation. You will define the technical roadmap, guide system architecture and optimization, and ensure our Kubernetes platform delivers top-tier reliability and throughput for distributed ML and HPC environments. The ideal candidate is a strong technical leader who thrives at the intersection of infrastructure engineering, AI systems, and high-performance computing.

Responsibilities:
  • Lead and mentor engineers designing and scaling NMC²'s bare-metal Kubernetes platform for HPC and ML workloads.
  • Architect and optimize GPU/CPU scheduling, resource management, and performance across multi-tenant compute clusters.
  • Drive automation and observability using Infrastructure-as-Code, CI/CD, and SRE best practices.
  • Collaborate with Research, Storage, and Network teams to integrate distributed filesystems, high-speed interconnects (InfiniBand, RoCE), and custom runtimes.
  • Partner with hardware and software vendors to improve tooling, influence product roadmaps, and streamline deployment.
  • Oversee platform reliability, capacity forecasting, and performance KPIs across thousands of nodes.
Requirements:
  • 7+ years in infrastructure, platform, or SRE engineering, including 2+ in technical leadership.
  • Proven experience operating Kubernetes environments tailored for HPC or ML training workloads-GPU scheduling, resource isolation, and workload optimization.
  • Deep knowledge of Linux systems, networking, and performance engineering on bare-metal hardware.
  • Experience managing large-scale, multi-tenant clusters and integrating distributed storage or high-speed networking.
  • Strong automation experience (Terraform, Ansible, or similar) and familiarity with observability tools (Prometheus, Grafana, Loki).
  • Excellent communication and stakeholder management skills; ability to translate complex technical direction into clear, actionable plans.
  • Bachelor's Degree or equivalent experience
Nice-to-Haves
  • Familiarity with HPC schedulers (Slurm, Flux) and container runtimes (containerd, CRI-O).
  • Contributions to open-source Kubernetes or ML infrastructure projects.
It is impossible to list every requirement for, or responsibility of, any position. Similarly, we cannot identify all the skills a position may require since job responsibilities and the Company's needs may change over time. Therefore, the above job description is not comprehensive or exhaustive. The Company reserves the right to adjust, add to or eliminate any aspect of the above description. The Company also retains the right to require all employees to undertake additional or different job responsibilities when necessary to meet business needs.

Must be legally authorized to work in the United States without the need for employer sponsorship, now or at any time in the future.

Benefits & Perks:
  • Company-Paid Lunch Stipend: Lunch is provided via GrubHub
  • Company-Paid Benefits: 100% Employer-Paid Medical in our High Deductible Health Plan, Dental and Vision benefits for employees and their families, 16 weeks of Paid Parental Leave, Employee Assistance Program, Life insurance, Short-Term Disability and Long-Term Disability
  • 401(k): Company will match 100% of your contributions up to 6%
  • Optional Employee-Paid Benefits: Medical insurance in our PPO plan and a variety of other benefits such as Health Savings Accounts (with Company Contribution), Flexible Spending Accounts, Supplemental Life Insurance, Wellhub and more.
  • Time Off: 25 days of Paid Time Off plus 12 company holidays


EQUAL OPPORTUNITY EMPLOYER

NORTHMARK STRATEGIES LLC IS AN EQUAL EMPLOYMENT OPPORTUNITY EMPLOYER. THE COMPANY'S POLICY IS NOT TO DISCRIMINATE AGAINST ANY APPLICANT OR EMPLOYEE BASED ON RACE, COLOR, RELIGION, NATIONAL ORIGIN, GENDER, AGE, SEXUAL ORIENTATION, GENDER IDENTITY OR EXPRESSION, MARITAL STATUS, MENTAL OR PHYSICAL DISABILITY, AND GENETIC INFORMATION, OR ANY OTHER BASIS PROTECTED BY APPLICABLE LAW. THE FIRM ALSO PROHIBITS HARASSMENT OF APPLICANTS OR EMPLOYEES BASED ON ANY OF THESE PROTECTED CATEGORIES.

  • Dallas, TX, United States NorthMark Strategies Full time

    The Company NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape...


  • Dallas, TX, United States G-Research Full time

    Do you want to tackle the biggest questions in finance with near infinite compute power at your fingertips? G-Research is a leading quantitative research and technology firm, with offices in London and Dallas. We are proud to employ some of the best people in their field and to nurture their talent in a dynamic, flexible and highly stimulating culture where...


  • Dallas, TX, United States G-Research Full time

    Do you want to tackle the biggest questions in finance with near infinite compute power at your fingertips? G-Research is a leading quantitative research and technology firm, with offices in London and Dallas. We are proud to employ some of the best people in their field and to nurture their talent in a dynamic, flexible and highly stimulating culture where...


  • Dallas, TX, United States Kaav Inc. Full time

    Job descriptionJob Title: Kubernetes Platform Technical Lead Relevant Exp 8 to 12 yrs Must Have: GCP, GKE, Kubernetes , Terraform & Lead Experience Experience:10 to 12 Years Location:Dallas, TX (Onsite) Duration:06 Months Technical Responsibilities Build and manage container platform with GKE on GCP via IAC Infrastructure as a Code Collaborate with client...


  • Dallas, TX, United States INSPYR Solutions Full time

    Title: HPC Storage Solutions Architect Location: Dallas, TX Duration: Direct Hire - Permanent Position Compensation: $200K - $325K base, plus bonus ranging from $50K-$200K annually Work Requirements: US Citizen, GC Holders or Authorized to Work in the U.S. HPC Storage Solutions Architect As an HPC Storage Solutions Architect, you will design, integrate, and...


  • Dallas, TX, United States INSPYR Solutions Full time

    Title: HPC Storage Solutions Architect Location: Dallas, TX Duration: Direct Hire - Permanent Position Compensation: $200K - $325K base, plus bonus ranging from $50K-$200K annually Work Requirements: US Citizen, GC Holders or Authorized to Work in the U.S. HPC Storage Solutions Architect As an HPC Storage Solutions Architect, you will design, integrate,...


  • Dallas, TX, United States INSPYR Solutions Full time

    Title: HPC Storage Solutions Architect Location: Dallas, TX Duration: Direct Hire - Permanent Position Compensation: $200K - $325K base, plus bonus ranging from $50K-$200K annually Work Requirements: US Citizen, GC Holders or Authorized to Work in the U.S. HPC Storage Solutions Architect As an HPC Storage Solutions Architect, you will design, integrate,...

  • DevOps Engineer

    5 days ago


    Dallas, TX, United States Purple Drive Full time

    About the Role: We're seeking a skilled DevOps Engineer to join our Dallas team for a 6-month engagement focused on container orchestration and CI/CD pipeline management. You'll be working with cutting-edge DevOps technologies to streamline our development and deployment processes. What You'll Do: Design and manage Kubernetes clusters for containerized...


  • Dallas, TX, United States Purple Drive Full time

    A DevOps Engineer with 10+ years of experience specializing in Kubernetes and Jenkins is responsible for architecting, automating, and optimizing large-scale CI/CD pipelines, container orchestration, and cloud-native deployments for enterprise environments. Responsibilities Design, build, and manage end-to-end CI/CD pipelines using Jenkins, integrating...


  • Dallas, TX, United States Purple Drive Full time

    A DevOps Engineer with 10+ years of experience specializing in Kubernetes and Jenkins is responsible for architecting, automating, and optimizing large-scale CI/CD pipelines, container orchestration, and cloud-native deployments for enterprise environments. Responsibilities Design, build, and manage end-to-end CI/CD pipelines using Jenkins, integrating...