Staff HPC Engineer

2 weeks ago


Mountain View, United States ASRC Federal Holding Company Full time

Job Title

Staff HPC Engineer

Location

NASA/AMES, MOFFETT FIELD-CA026

Job Description

ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of industry best practices. We are seeking to fill a role that primarily provides development for Supercomputing Batch Scheduling with Supercomputing Systems Administration secondary support for our NASA NACS High Performance Computing (HPC) contract.

Summary: The successful candidate will be an active supporting member of the ASRC Federal team reporting directly to the Manager of the Application Performance and Productivity (APP) group and matrixed directly to the Supercomputing Systems Team Manager.

An individual at this skill level should have demonstrated extensive experience working with common HPC batch schedulers e.g. (PBS, Slurm, or Moab/Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to run efficiently. This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources. Activities may include system patching, OS upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate the causes is critical skills for this work. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.

Duties and Responsibilities:

Designs, deploys and maintains HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data storage in production.Write and shepherd scalable feature designs through the entire software development process, from requirements and use cases to releaseDesigns and develops scripts for system administration, monitoring and usage reporting.Modify existing software to correct errors and/or improve performanceDesigns and develops scripts for system regression test and performance (file systems (Luster), scheduler (PBS), interconnect (HDR/NDR, Slingshot, ), high availability, etc.).Troubleshoots, isolates and resolves application, system and other technical problems (hardware, software, and network).Understands research use cases, researches and deploys new technologies, defining cost, performance and other trade-offs.Manages and maintains tools for configuration management (HPCM, Ansible & GIT), resource management, scheduling and all necessary aspects of HPC in accordance with best practices.Researches, deploys and manages networking and security infrastructure, including development of policies and procedures.Assists in developing and writing proposals and publications.Creates and provides clear documentation.
Mentoring junior staff and cross training peersAfter hours/weekend support as requiredModerate Supercomputing System Administration that contributes to: Day-to-day operations of the Linux HPC clusters and storage systemsProactive monitoring, analyze, and correct system issuesDevelopment of scripts to automate repetitive tasks or tools to enhance support of the HPC systemsSystem performance analysis and tuningBuilding, installing, and supporting user-requested softwareSupporting evaluation and assessment of new HPC technologyResolving user report issues and manage support tickets requests in Remedy

Requirements

Requirements:

Bachelor’s degree in computer science or related fieldStrong computer science background with in-depth systems-level knowledge in operating systems and networkingA minimum of 5 years experience of administration of HPC systems and scheduling software (PBS, Slurm, or Moab/Torque)A minimum of 5 years of experience of systems programming in heterogeneous, multi-platform HPC environmentsStrong ability to analyze, debug and maintain the integrity of an existing code baseDemonstrated equivalence of 5 years of Linux/UNIX user support experience and hands-on experience with administration of Linux systemsExperience working with HPC applications and proficiency in at least C, C++, or FortranSuperior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or BashStrong ability to interact with customers to understand needs, elicit requirements, and get feedback on prototype solutionsExcellent communication and people skills; excellent time management and organizational skillsExperience with system configuration management tools e.g. , puppet, chef, ansibleExperience with revision control software e.g. CVS, SVN, GitTrack record of delivering commercial quality software on schedule with excellent quality through multiple release cyclesProficiency at technical writing

Preferred Skills (Requesting Manager Defines):

Proficiency with analysis and problem-solving skills for debugging and optimization of applications Familiarity/proficiency with OpenMP and Message Passing Interface (MPI) programmingExperience with Lustre, and InfiniBandExperience with cloud technologies (AWS, Azure, GCP), OpenStack or Kubernetes is a plus
  • Staff HPC Engineer

    1 week ago


    Mountain View, United States ASRC Federal Holding Company Full time

    Job Description ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of...

  • Senior HPC Engineer

    2 weeks ago


    Mountain View, United States ASRC Federal Holding Company Full time

    Job TitleSenior HPC EngineerLocationNASA/AMES, MOFFETT FIELD-CA026Job DescriptionASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous,...

  • Senior HPC Engineer

    1 week ago


    Mountain View, United States ASRC Federal Holding Company Full time

    Job Description ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of...

  • Senior HPC Engineer

    3 days ago


    Mountain View, United States ASRC Federal Holding Company Full time

    Job Description ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of...

  • Senior HPC Engineer

    2 weeks ago


    Mountain View, United States ASRC Federal Holding Company Full time

    Job Description ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of...


  • Mountain View, United States RedLine Performance Solutions LLC Full time

    RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for over 25 years and is consistently determined to keep the "bar of excellence" quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle...


  • Mountain View, United States ASRC Federal Full time

    Job Description ASRC Federal, InuTeq proudly supports NASA's High Performance Computing Services program with our site in Mountain View, CA at the Ames Research Center. Make a DIFFERENCE on a program that supports 4 On-site Supercomputers totaling 18,000 nodes and 17+ combined petaflops. Our program provides High Performance Computing services...


  • Mountain View, United States Akkodis Full time

    Akkodis is seeking three Sr. Electrical / Hardware Engineers for our full-time positions near Mountain View, CA. Qualified candidates require professional experience in complex, high-speed hardware development along with some signal and power integrity basics for board design. Our environment utilizes the Cadence design toolchain (OrCAD/Allegro).Salary...


  • Mountain View, United States Diverse Lynx Full time

    Senior Hardware Engineer Mountain View, CA (Day 1 onsite) Fulltime Position Automotive client domain experience is mandatory Job Description: HPC hardware development experience in embedded environment Bachelor's degree in engineering Knowledge of HPC HW architecture and hardware development processes. Knowledge of Hardware platform...


  • Mountain View, United States Samsung Electronics America North America Full time

    For decades, Samsung has been leading the charge on innovation. We see beauty in achieving excellence and our passion for change fuels our discoveries, inventions, and breakthrough technology. We believe that technology can, and should, make the world a better place, so we create new possibilities for people everywhere, push the limits of what’s possible,...

  • Software Engineer

    1 week ago


    Mountain View, United States Codeium Full time

    We're looking for a software engineer to join us on our mission to build AI superpowers for developers. About Codeium In just over a year, Codeium has risen to become a leader in the AI developer tools space, giving hundreds of thousands of users around the world code autocomplete, in-editor chat assistants, and search. Our IDE extensions span 70+...

  • Software Engineer

    1 week ago


    Mountain View, United States Codeium Full time

    We're looking for a software engineer to join us on our mission to build AI superpowers for developers. About Codeium In just over a year, Codeium has risen to become a leader in the AI developer tools space, giving hundreds of thousands of users around the world code autocomplete, in-editor chat assistants, and search. Our IDE extensions span 70+...


  • Mountain View, United States Lightmatter Full time

    Sr. Staff Quality and Validation Engineer (Hardware) Lightmatter builds chips for artificial intelligence computing. Our architecture leverages the unique properties of light to enable fast and efficient inference and training engines. If you're a collaborative engineer or scientist with a passion for innovation, solving challenging technical problems, and...


  • Mountain View, United States Coupang Full time

    Job Overview: As a Staff Engineer on the Security Infrastructure team, you will build the platform that enables Coupang to win and grow our customers’ confidence while rapidly expanding and scaling our services. The Security Infrastructure team builds core security services and libraries used by all Coupang services to secure themselves and our customers....


  • Mountain View, United States Coupang Full time

    Job Overview: As a Staff Engineer on the Security Infrastructure team, you will build the platform that enables Coupang to win and grow our customers’ confidence while rapidly expanding and scaling our services. The Security Infrastructure team builds core security services and libraries used by all Coupang services to secure themselves and our customers....


  • Mountain View, United States Coupang Full time

    Job Overview: As a Staff Engineer on the Security Infrastructure team, you will build the platform that enables Coupang to win and grow our customers’ confidence while rapidly expanding and scaling our services. The Security Infrastructure team builds core security services and libraries used by all Coupang services to secure themselves and our customers....


  • Mountain View, United States Lightmatter Full time

    Sr. Staff Quality and Validation Engineer (Hardware) Lightmatter builds chips for artificial intelligence computing. Our architecture leverages the unique properties of light to enable fast and efficient inference and training engines. If you're a collaborative engineer or scientist with a passion for innovation, solving challenging technical problems, and...


  • Mountain View, United States Codeium Full time

    We're looking for a software engineering intern to join us on our mission to build AI superpowers for developers. About Codeium In just over a year, Codeium has risen to become a leader in the AI developer tools space, giving hundreds of thousands of users around the world code autocomplete, in-editor chat assistants, and search. Our IDE extensions span 70+...


  • Mountain View, United States PredictSpring Full time

    Job Overview As Senior Staff Engineer in the DevOps team, you will build and support applications and infrastructure enabling teams to configure, deploy, operate, and monitor the mission-critical services powering offered by the PredictSpring Cloud platform that serves the world's leading brands and retailers for their Modern POS Platform. You will work in a...

  • Senior Staff Engineer

    4 weeks ago


    Mountain View, United States PredictSpring Full time

    Job OverviewAs Senior Staff Engineer in the DevOps team, you will build and support applications and infrastructure enabling teams to configure, deploy, operate, and monitor the mission-critical services powering offered by the PredictSpring Cloud platform that serves the world's leading brands and retailers for their Modern POS Platform. You will work in a...