Lead HPC Systems Engineer

2 weeks ago


Remote, Oregon, United States St. Jude Children's Research Hospital Full time

Overview:

As a Senior HPC Infrastructure Engineer at St. Jude Children's Research Hospital, you will be instrumental in advancing our high-performance computing (HPC) and artificial intelligence (AI) infrastructure. Your role will focus on the design, implementation, and optimization of our sophisticated HPC clusters and servers, ensuring that our research computing environment achieves exceptional scalability, redundancy, and performance.

Key Responsibilities:

  • Architect and implement cutting-edge HPC/AI systems to facilitate innovative research initiatives.
  • Supervise the continuous monitoring, support, and maintenance of HPC/AI clusters to guarantee optimal performance and reliability.
  • Lead system upgrades and customizations, collaborating with database administrators, software developers, network operations, and data center teams.
  • Manage a diverse array of computer systems and application software, ensuring adherence to the highest standards of functionality and efficiency.
  • Provide ongoing support and monitoring of our research computing infrastructure, delivering outstanding service around the clock.

What We Offer:

  • An opportunity to engage with state-of-the-art technology in a collaborative and dynamic environment.
  • A role that significantly contributes to the success of pioneering research projects.
  • The chance to work alongside leading professionals across various fields.

If you are passionate about HPC technology and excel in a fast-paced, innovative atmosphere, we encourage you to consider this opportunity.

Job Responsibilities:

  • Oversee the configuration and management of IT infrastructure to meet various requirements, including data retention, security, business continuity, and disaster recovery.
  • Assess the efficiency and effectiveness of infrastructure service delivery methods and procedures.
  • Manage internal infrastructure in accordance with established regulations and standards.
  • Implement and monitor incident/problem management and disaster recovery for infrastructure support.
  • Provide current systems usage statistics and future growth estimates based on user demand.
  • Collaborate with internal teams to develop prioritization, metrics, and processes related to capacity planning and infrastructure availability.
  • Present capacity planning and performance reports to senior leadership during meetings.
  • Benchmark, analyze, and recommend improvements for IT infrastructure.
  • Perform additional duties as assigned to achieve departmental and institutional goals.
  • Maintain regular and predictable attendance.

Minimum Education and/or Training:

  • Bachelor's degree in Computer Science, Engineering, Business, or a related field is required.
  • A Master's degree is preferred.

Minimum Experience:

  • A minimum of four (4) years of IT experience in infrastructure operations and engineering environments is required.
  • Experience with Red Hat Enterprise Linux (RHEL) is highly preferred.
  • Experience in managing an HPC cluster is essential.
  • Familiarity with Slurm and/or LSF is highly preferred.
  • Experience with Kubernetes (e.g., Rancher, OpenShift) is a plus.
  • Experience with HPC cluster management tools such as Base Command Manager or Bright Cluster Manager is highly preferred.
  • Experience with IBM Spectrum Scale (GPFS) is required; knowledge of Lustre is a plus.
  • Experience with Message Passing Interface (MPI) is highly preferred.
  • Knowledge of InfiniBand, Ethernet, and TCP/IP networking is highly preferred.
  • Experience with NVIDIA GPUs is required; familiarity with AMD GPUs is a plus.
  • Advanced knowledge of HPC technologies and principles is essential.
  • Strong understanding of Linux security and shell scripting is required.
  • Proven performance in a similar role is necessary.

Compensation:

In accordance with U.S. state and municipal pay transparency laws, St. Jude provides a reasonable estimate of the compensation range for this role. The estimated salary range for the Senior HPC Infrastructure Engineer position is $94,640 - $169,520 per year.

Diversity, Equity, and Inclusion:

St. Jude Children's Research Hospital is committed to diversity, equity, and inclusion, ensuring that our workforce reflects the global community we serve. Our founder envisioned a hospital that treats children from all backgrounds, and we continue to uphold this mission through our research and treatment efforts.

No Search Firms:

St. Jude Children's Research Hospital does not accept unsolicited assistance from search firms for employment opportunities.



  • Remote, Oregon, United States St. Jude Children's Research Hospital Full time

    Overview: As a pivotal member of our innovative team, the Senior HPC Infrastructure Engineer will be instrumental in advancing our high-performance computing (HPC) and artificial intelligence (AI) frameworks. This role focuses on the design, implementation, and enhancement of our sophisticated HPC clusters and servers, ensuring optimal performance and...


  • Remote, Oregon, United States St. Jude Children's Research Hospital Full time

    As a Senior HPC Infrastructure Engineer, you will be instrumental in advancing the capabilities of high-performance computing (HPC) and artificial intelligence (AI) infrastructure. Your role will involve the strategic design, execution, and enhancement of our sophisticated HPC clusters and servers, ensuring optimal performance and reliability in our...


  • Remote, Oregon, United States St. Jude Children's Research Hospital Full time

    About the RoleWe are seeking a highly skilled Senior HPC Infrastructure Engineer to join our team at St. Jude Children's Research Hospital. As a key member of our infrastructure team, you will play a critical role in designing, implementing, and optimizing our high-performance computing (HPC) clusters and servers.Key ResponsibilitiesLead HPC System...


  • Remote, Oregon, United States GE Full time

    Job Description SummaryThe Lead Engineer will be a member of the Advanced Design & Development team within Advanced Applications Engineering at GE Hitachi Nuclear Energy. The AD&D team supports the development of advanced nuclear applications through technical leadership on early phase projects and delivering high caliber analysis of complex, first-of-a-kind...


  • Remote, Oregon, United States Sargent & Lundy Full time

    Position Overview Sargent & Lundy's Government Services Division is at the forefront of engineering design and advisory services, providing essential support to management and operational contractors for U.S. Department of Energy (DOE) facilities and national laboratories. Our focus includes aiding the DOE Environmental Management Directorate and the...


  • Remote, Oregon, United States bodo Full time

    At Bodo, we are driven by a mission to revolutionize how organizations harness the power of data by democratizing efficient compute at scale. With the creation of the first compute engine that brings HPC levels of performance and efficiency to large-scale data processing, we have already helped some of the most data-forward companies in the world with their...


  • Remote, Oregon, United States Sargent & Lundy Full time

    Position Overview Sargent & Lundy's Government Services Division is at the forefront of engineering design and advisory services, providing vital support to the management and operational contractors for U.S. Department of Energy (DOE) facilities and national laboratories. Our focus includes assisting the DOE Environmental Management Directorate and the...


  • Remote, Oregon, United States ICE Consulting Full time

    DescriptionWho We Are:We are a privately owned leading Managed IT Services company (Managed Service Provider). Since 1997 we have specialized in providing managed IT services and managed security services for our clients. which are made up of small to medium-sized enterprises. We are looking for highly passionate individuals to join our team to help and...


  • Remote, Oregon, United States GE Full time

    Job Description SummaryThe I&C Systems Design Engineer is responsible for design and analysis of I&C systems for nuclear power plant applications.Job DescriptionResponsible for Plant I&C Systems design activities that support:GE's BWRX-300 Small Modular Reactor (SMR) and/or Gen-IV reactor technologies including Natrium and ARC sodium fast reactors...


  • Remote, Oregon, United States General Motors Full time

    DescriptionWe are looking for a technical expert to join our team and enhance the robustness and scalability of the infrastructure to support scaling our Machine Learning workloads. This role will involve working across various areas, from enhancing underlying HPC infrastructure to optimizing Kubernetes and Kubeflow setups, as well as refining training...

  • Engineering Lead

    1 month ago


    Remote, Oregon, United States Alloy Automation Full time

    Alloy Automation (YC W20) is more than just a tech startup - we're building the integration infrastructure that everyone from fast growing startups to Fortune 500's rely on to launch and manage their integrations – at scale. Our engineering team delivers a best in class, incredible experience for our customers who range from global brands like Burberry...


  • Remote, Oregon, United States Security Cleared Jobs Full time

    Position OverviewAs a key member of the Mechanical Systems team, the Mechanical Systems Integration Lead will serve as the Systems Integrator for a designated engine module within the program. This role requires close collaboration with various design teams and engineering organizations to ensure that design execution aligns with program goals, emphasizing...

  • Lead Engineer

    1 month ago


    Remote, Oregon, United States GE Full time

    Job Description SummaryLead Engineer - Simulation Assisted Engineering (SAE) works within the Plant Integration Engineering team by enabling LEAN integrated plant design (IPD) using modern digital engineering tools and techniques. SAE is a collaborative, cross-disciplinary, systems engineering approach to complex IPD, analysis, and optimization. SAE is based...

  • Sales Engineer

    2 months ago


    Remote, Oregon, United States Aethir Full time

    DescriptionAethir is revolutionizing access to high-performance computing for AI enterprises. We're looking for a talented Sales Engineer to join our dynamic team and support our sales efforts in promoting our industry-leading, decentralized GPU-as-a-service (GPUaaS) platform.About the Role:In this crucial role, you'll bridge the gap between cutting-edge...


  • Remote, Oregon, United States GEM Technologies(gemtechnologies) Full time

    DescriptionThe Company: GEM Technologies is a Managed Services Provider that delivers personalized solutions to help local businesses make the most of their technology. At GEM Technologies, we are the New York City area's leading experts in emerging technology implementation. We deliver comprehensive and scalable solutions that keep businesses competitive,...

  • Systems Engineer

    2 months ago


    Remote, Oregon, United States Escape Velocity Entertainment Full time

    DescriptionWhat We Are Looking For:We're looking for a Systems Engineer, to join our team here at Escape Velocity, which collaborates remotely from around the world. Your mission will be to help make our AAA game run across all our target hardware platforms (PC, Xbox, PlayStation, Mobile).RequirementsWhat We Will Do Together:Develop and maintain game...

  • Software Engineer

    2 months ago


    Remote, Oregon, United States Applied Systems Full time

    Job OverviewApplied Systems, Inc., a worldwide leader in insurance technology, is currently searching for an Software Engineer to join our Epic Benefits team. This role involves developing and maintaining robust web applications and micro services using React and Go lang. As a Software Engineer, you will collaborate with Architects, Product Owners, and Scrum...


  • Remote, Oregon, United States Intelerad Medical Systems Full time

    Company DescriptionAt Intelerad, we believe the path to answers in healthcare should be clear-whether you are waiting for a diagnosis or trying to expedite one. Our medical imaging solutions streamline the flow of information, simplifying complex processes, maximizing efficiencies, and shining a light on the unknown. We empower physicians to get patients the...


  • Remote, Oregon, United States nurdsoft Full time

    Lead DevOps EngineerJoin our forward-thinking team to lead the deployment of innovative cloud solutions and secure products for our enterprise clients. As a DevOps Tech Lead, you'll champion the creation of robust, scalable cloud infrastructures using the latest technologies.Responsibilities:Spearhead DevOps projects, leveraging VMs, containers, and...


  • Remote, Oregon, United States Henry Meds Full time

    About Henry Meds:Tens of millions of Americans are unable to manage their chronic conditions with commercial medications. Using specialized compounded formulas tailored to individual patient needs, Henry helps people who have been left behind by the commercial market, all while remaining easy, accessible, and affordable. Our customers get access to the care...