Staff HPC Engineer

6 months ago


Mountain View, United States ASRC Federal Holding Company Full time

Job Title

Staff HPC Engineer

Location

NASA/AMES, MOFFETT FIELD-CA026

Job Description

ASRC Federal is searching for a Staff HPC Engineer to support Inuteq LLC out of NASA AMES, CA

ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of industry best practices. We are seeking to fill a role that primarily provides development for Supercomputing Batch Scheduling with Supercomputing Systems Administration secondary support for our NASA NACS High Performance Computing (HPC) contract.

Summary: The successful candidate will be an active supporting member of the ASRC Federal team reporting directly to the Manager of the Application Performance and Productivity (APP) group and matrixed directly to the Supercomputing Systems Team Manager.

An individual at this skill level should have demonstrated extensive experience working with common HPC batch schedulers e.g. (PBS, Slurm, or Moab/Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to run efficiently. This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources. Activities may include system patching, OS upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate the causes is critical skills for this work. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.

Duties and Responsibilities:

Designs, deploys and maintains HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data storage in production.Write and shepherd scalable feature designs through the entire software development process, from requirements and use cases to releaseDesigns and develops scripts for system administration, monitoring and usage reporting.Modify existing software to correct errors and/or improve performanceDesigns and develops scripts for system regression test and performance (file systems (Luster), scheduler (PBS), interconnect (HDR/NDR, Slingshot, ), high availability, etc.).Troubleshoots, isolates and resolves application, system and other technical problems (hardware, software, and network).Understands research use cases, researches and deploys new technologies, defining cost, performance and other trade-offs.Manages and maintains tools for configuration management (HPCM, Ansible & GIT), resource management, scheduling and all necessary aspects of HPC in accordance with best practices.Researches, deploys and manages networking and security infrastructure, including development of policies and procedures.Assists in developing and writing proposals and publications.Creates and provides clear documentation.
Mentoring junior staff and cross training peersAfter hours/weekend support as requiredModerate Supercomputing System Administration that contributes to: Day-to-day operations of the Linux HPC clusters and storage systemsProactive monitoring, analyze, and correct system issuesDevelopment of scripts to automate repetitive tasks or tools to enhance support of the HPC systemsSystem performance analysis and tuningBuilding, installing, and supporting user-requested softwareSupporting evaluation and assessment of new HPC technologyResolving user report issues and manage support tickets requests in Remedy

Requirements

Requirements:

Bachelor’s degree in computer science or related fieldStrong computer science background with in-depth systems-level knowledge in operating systems and networkingA minimum of 5 years experience of administration of HPC systems and scheduling software (PBS, Slurm, or Moab/Torque)A minimum of 5 years of experience of systems programming in heterogeneous, multi-platform HPC environmentsStrong ability to analyze, debug and maintain the integrity of an existing code baseDemonstrated equivalence of 5 years of Linux/UNIX user support experience and hands-on experience with administration of Linux systemsExperience working with HPC applications and proficiency in at least C, C++, or FortranSuperior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or BashStrong ability to interact with customers to understand needs, elicit requirements, and get feedback on prototype solutionsExcellent communication and people skills; excellent time management and organizational skillsExperience with system configuration management tools e.g. , puppet, chef, ansibleExperience with revision control software e.g. CVS, SVN, GitTrack record of delivering commercial quality software on schedule with excellent quality through multiple release cyclesProficiency at technical writing

Preferred Skills (Requesting Manager Defines):

Proficiency with analysis and problem-solving skills for debugging and optimization of applications Familiarity/proficiency with OpenMP and Message Passing Interface (MPI) programmingExperience with Lustre, and InfiniBandExperience with cloud technologies (AWS, Azure, GCP), OpenStack or Kubernetes is a plus
  • Senior HPC Engineer

    6 months ago


    Mountain View, United States ASRC Federal Holding Company Full time

    Job TitleSenior HPC EngineerLocationNASA/AMES, MOFFETT FIELD-CA026Job DescriptionASRC Federal is searching for a Senior HPC Engineer to support Inuteq LLC which this role is fully telework ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to...

  • HPC Systems Architect

    2 weeks ago


    Mountain View, United States MindSource Full time

    Job Title: HPC Systems Architect/Cloud Platform ArchitectDuration: 12 MonthsLocation: Mountain View, CA HybridJob OverviewWe are seeking an experienced HPC Systems Architect//Cloud Platform Architect to lead our organization s strategic direction in cloud adoption and high-performance computing (HPC) systems. This role involves designing and implementing...

  • HPC Systems Architect

    2 weeks ago


    mountain view, United States MindSource Full time

    Job Title: HPC Systems Architect/Cloud Platform ArchitectDuration: 12 MonthsLocation: Mountain View, CA HybridJob OverviewWe are seeking an experienced HPC Systems Architect//Cloud Platform Architect to lead our organization s strategic direction in cloud adoption and high-performance computing (HPC) systems. This role involves designing and implementing...

  • HPC Systems Architect

    2 weeks ago


    mountain view, United States MindSource Full time

    Job Title: HPC Systems Architect/Cloud Platform ArchitectDuration: 12 MonthsLocation: Mountain View, CA HybridJob OverviewWe are seeking an experienced HPC Systems Architect//Cloud Platform Architect to lead our organization s strategic direction in cloud adoption and high-performance computing (HPC) systems. This role involves designing and implementing...


  • Mountain View, United States Varada Consulting, LLC Full time

    With competitive compensation and benefits, a supportive work environment, and opportunities for growth and advancement, Varada Consulting is the place to be for a rewarding and challenging career in IT.Job Location: Mountain View, CA (Mon-Fri Regular Business hours, Hybrid 3 days onsite/2 days remote)This position is eligible for a $5,000 sign-on bonus and...


  • Mountain View, United States Varada Consulting, LLC Full time

    With competitive compensation and benefits, a supportive work environment, and opportunities for growth and advancement, Varada Consulting is the place to be for a rewarding and challenging career in IT.Job Location: Mountain View, CA (Mon-Fri Regular Business hours, Hybrid 3 days onsite/2 days remote)This position is eligible for a $5,000 sign-on bonus and...


  • Mountain View, California, United States Groq Full time

    We are seeking a highly skilled Senior Systems Software Engineer to join our team at Groq. As a key member of our multi-disciplinary team, you will play a crucial role in the development, integration, and testing of machine learning HPC platforms.Key Responsibilities:Work within a multi-disciplinary team environment to develop, integrate, and test machine...


  • Mountain View, California, United States Enfabrica Full time

    Technical ExpertiseAs a Principal Customer Engineer at Enfabrica, you will be responsible for delivering technical solutions to our customers. This role requires a deep understanding of data center and AI/ML/HPC networking technologies, as well as experience in bring up, troubleshooting, and performance tuning of large-scale DC/HPC/AI/ML cluster...


  • Mountain View, Arkansas, United States Codeium Full time

    We're seeking a high-performance ML-focused software engineer to join our mission to build AI superpowers for developers. Our team serves one of the largest scale and most demanding LLM applications in the world. About Codeium Featured on the Forbes AI 50 list, Codeium has risen to become a leader in the AI developer tools space in just over a year, giving...


  • Mountain View, United States Codeium Full time

    We're looking for a high performance ML-focused software engineer to join us on our mission to build AI superpowers for developers. We serve one of the largest scale and most demanding LLM applications in the world. About Codeium Featured on the Forbes AI 50 list, Codeium has risen to become a leader in the AI developer tools space in just over a year,...

  • Mechanical Engineer

    4 weeks ago


    Mountain View, Arkansas, United States Staff Perm Full time

    Job Title: Mechanical Engineer - Manufacturing Process ExpertJob Summary:We are seeking a skilled Mechanical Engineer to join our team as a Manufacturing Process Expert. The successful candidate will have a strong background in mechanical engineering and experience in manufacturing process improvement.Key Responsibilities:Design and implement efficient...


  • Mountain View, California, United States Aurora Innovation Full time

    At Aurora Innovation, we're pushing the boundaries of self-driving technology to make transportation safer, more accessible, and efficient. We're seeking a highly skilled Staff Modeling/Simulation Engineer to join our team.About the RoleAs a Staff Modeling/Simulation Engineer, you will be responsible for developing and verifying physics-based and algorithmic...


  • Mountain View, California, United States Aurora Innovation Full time

    Job Title: Staff Optical Mechanical Design EngineerWe are seeking a highly skilled Staff Optical Mechanical Design Engineer to join our team at Aurora Innovation. As a key member of our Mechanical Design team, you will be responsible for designing and validating opto-mechanical enclosures that achieve or exceed performance, reliability, manufacturability,...


  • Mountain View, California, United States Inworld AI Full time

    About Inworld AIInworld AI is a leading AI engine for games, enabling developers to build groundbreaking game mechanics, dynamic NPCs and worlds that evolve with each action. We power experiences built by top game developers and have partnerships with key industry players such as Microsoft/Xbox, Epic Games and Unity.Our Technical Operations team manages the...


  • Mountain View, California, United States General Motors Full time

    About the Role:This is a unique opportunity to lead and develop high-performance backend services and systems for GM Commercial Services' new software solution.As a Staff Backend Software Engineer, you will be responsible for designing and building the flagship GM Commercial Services Software, providing technical leadership and expertise to the team.Key...


  • Mountain View, California, United States Wisk Aero Full time

    We are seeking a highly skilled Staff Software Engineer to join our team at Wisk Aero. As a key member of our Engineering organization, you will be responsible for architecting and implementing data infrastructure, devops, and APIs to automate as many processes as possible for engineering.The ideal candidate will have a strong background in software...


  • Mountain View, California, United States Aurora Innovation Full time

    We are seeking a highly skilled Staff Hardware Reliability Engineer - Computer to join our team at Aurora Innovation.The Hardware Reliability team is dedicated to ensuring the robustness and dependability of hardware systems in the Aurora hardware stack.As a Staff Hardware Reliability Engineer - Computer, you will lead and oversee hardware reliability...

  • Software Engineer

    4 months ago


    Mountain View, United States Codeium Full time

    We're looking for a software engineer to join us on our mission to build AI superpowers for developers. About Codeium Featured on the Forbes AI 50 list, Codeium has risen to become a leader in the AI developer tools space in just over a year, giving hundreds of thousands of users around the world code autocomplete, in-editor chat assistants, and more. our...


  • Mountain View, United States Otter.ai Full time

    The Opportunity We are seeking a highly skilled and experienced Staff iOS Engineer to lead the development and ongoing enhancement of our consumer mobile app on the iOS platform. In this role, you will take on significant technical ownership and leadership responsibilities, collaborating with a dynamic engineering team and cross-functional partners,...


  • mountain view, United States DeepRec.ai Full time

    Staff Machine Learning EngineerLocation: Onsite / Hybrid, Mountain View, CACompany: Stealth Mode GenAI StartupJoin a stealth mode GenAI startup backed by a top VC, they are currently a team of 10, with folks from Google, PayPal, and Coinbase to name a few. This company have recently secured a huge funding round and are focused on building cutting-edge...