Lead Systems Engineer

3 weeks ago


Palo Alto, United States CareerBuilder Full time

At Hippocratic AI, we are at the forefront of technological innovation, leveraging advanced computing resources to solve complex problems. Our dedicated GPU clusters, including high-end NVIDIA A100 and H100 models, are crucial for our data processing, machine learning, and computational tasks, including the development and optimization of Large Language Models (LLMs).
Position Overview:
As Lead System Administrator specializing in Slurm, HPC, and GPUs, you will play a crucial role in designing, implementing, and maintaining our advanced computing infrastructure. Your in-depth knowledge of Slurm, HPC principles, and GPU utilization will enable you to optimize our system performance, ensure reliable operation, and support our growing computational needs.
Responsibilities:
GPU Cluster Management:
Run high-performance compute services in public cloud environments (AWS, GCP, and Azure) like Sagemaker.

Knowledge of hardware components, such as GPUs (including high-end models like NVIDIA A100 and H100), and familiarity with NVIDIA Container Toolkit.

Experience in managing GPU nodes in cloud environments, ensuring optimal performance and reliability.

Orchestration and Automation:
Proficiency in Kubernetes for container orchestration and Slurm for workload management to efficiently distribute tasks across the GPU cluster.

Experience in setting up and configuring these orchestration tools to ensure high availability and scalability of cluster resources.

Troubleshooting and Debugging:
Ability to provide in-depth technical support for complex issues, including debugging and troubleshooting high-end GPUs.

Familiarity with debugging tools and techniques specific to GPU hardware and software.

Performance Optimization:
Continuous monitoring of system performance to identify bottlenecks and implement solutions to optimize resource utilization and throughput.

Knowledge of performance tuning techniques for GPU clusters and the ability to apply them effectively.

Security and Compliance:
Ensure adherence to security best practices and compliance requirements for GPU cluster infrastructure.

Implementation and management of security protocols and disaster recovery strategies to safeguard cluster resources and data.

Collaboration and Support:
Work closely with other engineering, research and applied science teams to understand and support their computational needs.

Offer guidance and expertise on utilizing the GPU cluster efficiently for various tasks and applications.

Participate in planning and executing future expansion or enhancement of cluster capabilities to meet evolving computational requirements.

Requirements:
Education:
Bachelors degree in Computer Science, Electrical Engineering, or a related field. Masters degree preferred.

Experience:
At least 3 years of experience in managing and maintaining GPU clusters, preferably in the cloud, with hands-on experience with NVIDIA A100 and H100 GPUs or similar high-end models.

Technical Skills:
Proficiency in Kubernetes for container orchestration and management, with experience in deploying, scaling, and managing containerized applications within Kubernetes clusters, including familiarity with AWS Kubernetes services for cloud deployment and management.

Experience with Slurm for workload management in GPU cluster environments.

Deep understanding of GPU hardware, including experience with debugging and troubleshooting GPU issues.

Strong background in Linux/Unix administration, scripting (e.g., Bash, Python), and automation tools, with expertise in Ansible for configuration management and automation tasks.

Familiarity with network configuration, storage systems, and security protocols relevant to GPU clusters.

Problem-Solving:
Exceptional analytical and problem-solving skills, with the ability to handle complex technical challenges effectively.

Communication:
Excellent communication and documentation skills, capable of collaborating effectively across diverse teams.

About Hippocratic AI
Hippocratic AI is dedicated to developing a safety-focused large language model (LLM) tailored for the healthcare sector. We firmly believe in the potential of generative AI to significantly enhance global healthcare accessibility, provided it is developed and tested responsibly. Mirroring the principles of the Hippocratic oath that guides medical professionals, our model is designed with the ethos of "Do no Harm."

#J-18808-Ljbffr


  • Engineering Lead

    3 weeks ago


    Palo Alto, United States Pika labs Full time

    ROLE: ENGINEERING LEAD Summary: We are in search of a product-focused Engineering Lead with a proven track record in web and mobile application development and infrastructure management. The ideal candidate will be a hands-on leader who excels in a dynamic environment and is capable of driving our engineering team towards delivering innovative products. Job...


  • Palo Alto, United States System Safety Inc Full time

    At Ford Motor Company, we believe freedom of movement drives human progress. We also believe in providing you with the freedom to define and realize your dreams! With our incredible plans for the future of mobility, we have a wide variety of opportunities for you to accelerate your career potential as you help us define tomorrow’s transportation. Ford...


  • Palo Alto, United States JPMorgan Chase & Co. Full time

    Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Consumer and Community Banking division, you are an integral part of an agile team that works to enhance, build, and deliver trusted market-leading...


  • Palo Alto, United States BHO Tech Full time

    Job Description: We are the leader in hardware emulation-acceleration technologies and products. We are looking for a hands-on Software Engineering Director who wants to expand their scope and grow their career in our platform software group. Our massive parallel processor based emulation-acceleration system platform is the most advanced...


  • Palo Alto, United States Tesla Full time

    **System Validation Engineer Chassis and Drive Systems** ????Engineering & Information Technology????Palo Alto, California?? ID112796???? **Locations** * Palo Alto, CA * Austin, TX **The Role** Tesla is looking for a highly motivated individual to join the Vehicle Software organizations Systems Validation Team with a focus on chassis and drive systems. It is...

  • Lead IOS Engineer

    2 weeks ago


    Palo Alto, United States Equation Staffing Full time

    Job DescriptionJob DescriptionLead IOS EngineerClient is growing their engineering team and seeking a leader to set the tone for a rapidly growing E-learning platform used by the nations top teams in pro/college sports and even the most elite military groups. What do elite pro sports teams and military groups have in common? They both value teaching & that...

  • CAE Engineer

    3 weeks ago


    Palo Alto, United States Tesla Full time

    **CAE Engineer - Drive Systems** ????Engineering & Information Technology????Palo Alto, California?? ID113945???? The Drive Systems team designs, optimizes, and engineers world class EV powertrains that push the boundaries of efficiency, performance, and time to market. This can only be done with a deep understanding of engineering first principles and the...


  • Palo Alto, United States CareerBuilder Full time

    [Full Time] Back-end Engineer Lead at Just Appraised (United States) | BEAMSTART Jobs Back-end Engineer Lead Just Appraised United States Date Posted 27 Jun, 2022 Work Location Palo Alto, United States Salary Offered Not Specified Job Type Full Time Experience Required No experience required Remote Work Yes Stock Options No Vacancies 1 available Competitive...


  • Palo Alto, United States Mattermost Full time

    [Full Time] Lead Application Security Engineer (Remote) at Mattermost (United States) | BEAMSTART Jobs Lead Application Security Engineer (Remote) Mattermost United States Date Posted 06 Jul, 2022 Work Location Palo Alto, United States Salary Offered Not Specified Job Type Full Time Experience Required 6+ years Remote Work Yes Stock Options No Vacancies 1...


  • Palo Alto, United States ArrayLabs, LLC Full time

    Array Labs is building a distributed radar imaging constellation to power the first accurate, real-time 3D model of the world. As a communications systems engineer on the hardware engineering team, you will have the unique opportunity to shape the design of the first orbital MIMO radar imaging system. In this critical role, you will be collaborating with...


  • Palo Alto, United States ArrayLabs, LLC Full time

    Array Labs is building a distributed radar imaging constellation to power the first accurate, real-time 3D model of the world. As a communications systems engineer on the hardware engineering team, you will have the unique opportunity to shape the design of the first orbital MIMO radar imaging system. In this critical role, you will be collaborating with...

  • Software Team Lead

    2 days ago


    Palo Alto, United States Instrumental Inc Full time

    Manufacturing output is half of Gross World Product, but 20% of its resources are spent on scrap, rework, and mistakes. Our technology accelerates how the world's best brands bring new products to market by collecting unique data from assembly lines and feeding it to AI-powered software tools to find and fix manufacturing issues. Our mission is to empower...


  • Palo Alto, United States ArrayLabs, LLC Full time

    Array Labs is building a distributed radar imaging constellation to power the first accurate, real-time 3D model of the world. As a communications systems engineer on the hardware engineering team, you will have the unique opportunity to shape the design of the first orbital MIMO radar imaging system. In this critical role, you will be collaborating with...


  • Palo Alto, California, United States Tesla Full time

    Tesla's Drive Systems Engineering (DSE) group is seeking a skilled and highly motivated manufacturing engineer that is knowledgeable in gear manufacturing and quality assurance inspection to be a leader in the DSE Gear Lab initiative.As a manufacturing engineer within the DSE Gear Lab, we look to you to help develop process standards and operate specialized...


  • Palo Alto, United States 0006 Varian Medical Systems Inc, Corp Headquarter Full time

    Together, we can beat cancer. At Varian, a Siemens Healthineers Company, we bring together the world's best talent to realize our vision of a world without fear of cancer. Together, we work passionately to develop and deliver easy-to-use, efficient oncology solutions. We are part of an incredible community of scientists, clinicians, developers,...


  • Palo Alto, United States ArrayLabs, LLC Full time

    Array Labs is building a distributed radar imaging constellation to power the first accurate, real-time 3D model of the world. As an RF Systems Engineer, you will have the unique opportunity to shape the design of an advanced RF transceiver that will be used by both the communications and radar systems. In this critical role, you will be collaborating with...


  • Palo Alto, United States Instrumental Inc Full time

    Manufacturing output is half of Gross World Product, but 20% of its resources are spent on scrap, rework, and mistakes. Our technology accelerates how the worlds best brands bring new products to market by collecting unique data from assembly lines and feeding it to AI-powered software tools to find and fix manufacturing issues. Our mission is to empower...

  • Research Engineer

    3 weeks ago


    Palo Alto, United States Pika 1.0 Full time

    ROLE: RESEARCH ENGINEER Summary: As a Research Engineer specializing in Machine Learning and Systems Engineering at our company, you will be instrumental in pioneering sophisticated AI solutions. This role demands a unique blend of leadership in conducting end-to-end research projects and technical expertise in building scalable systems. You'll be part of an...


  • Palo Alto, California, United States Varian Medical Systems, Inc Full time

    At Varian, a Siemens Healthineers Company, we bring together the world's best talent to realize our vision of a world without fear of cancer.Varian Medical Systems (a Siemens Healthineers Company) is looking for a Sr Electrical Engineer to join ourDevelop schematics comprised of FPGAs, SoCs, memories, communications devices, analog devices, ADCs, DACs,...

  • Software Engineer

    2 weeks ago


    Palo Alto, United States Gauss Labs Full time

    As a Gaussian Software Engineer - Data, you will be responsible for leading the architecture, design, and development of the data systems within our AI products for the semiconductor industry. You will be working with other passionate and talented Software Engineers, AI Engineers, and Applied Scientists and have opportunities to learn about various AI...