IT InfiniBand/GPU

7 days ago


San Jose, United States Cadence Design Systems Full time

At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology.Cadence is looking for a Sr Staff Systems Engineer who accelerates strategic customer deployments and ensures on-time bring-up and deployment of HPC infrastructure and troubleshooting and supports technical roles supporting HPC, InfiniBand, and GPU at our San Jose locationThe successful candidate will be a hands-on technical candidate within the infrastructure team and be exposed to customer interfaces dealing with the Windows and Linux OS.The System Engineer will need experience in Linux environments and proficiency in tasks such as shell scripting.Role: IT -Sr Staff Systems EngineerLocation: San Jose, CAMust Haves15+ years of experience in system administration and engineering.Minimum five years overall experience in technical roles supporting HPC, InfiniBand, and GPUStrong knowledge of Linux operating systems and networking and security concepts.Document and drive acceptance and qualification test plans, procedures, and reportsCustomer deployments and ensure on-time bring-up of GPU Servers. InfiniBand fabric bring-up, configuration, and subnet management on the IB switchParticipate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scaleProvide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutionsProvide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutionsRequirementsAccelerate strategic customer deployments and ensure on-time bring-up and deployment of HPC infrastructureParticipate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scaleProvide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutionsDevelopment and implementation of server and rack-level telemetry aspects, collaborate and establish continuous improvements in our design flowsRecent experience in critical data center technologies such as server architectures, software containers, job schedulers, and parallel computing. Deployment and operation of large-scale systems; resilient system design; and clustering of computing resourcescluster management for HPC and actively connect with management regarding any problems with the equipment and propose a resolutionEstablish and maintain IT infrastructure and procedures for customer-facing and internal systemsActively establish the technical relationship with our customer’s engineers, management, and architects at focus accountsCreate and develop test plans for new features on each product. Recommend improvements to enable automated scripting for testing and archiving of results. Develop HPC computing strategies for cloud-based computing, GPU-accelerated computing, etc.Provide remote cluster support to large environments, including scalability/flexibility and troubleshooting end-user issues involving job submission, runtime, and resource access.InfiniBand fabric configuration and administration on Red hat/Centos/Linux experience in configuring PKeys and troubleshooting the end-to-end InfiniBand environmentInfiniBand fabric bring-up, configuration, subnet management, and monitoring on the IB switch and client side for multi-tenancy setup, understanding of IPoIB communication modesPerformance comparison of the InfiniBand network with cluster interconnects and debugging the InfiniBand performance-related issuesAutomate configuration management, software updates, and system availability maintenance and monitoring using modern DevOps tools (Ansible, Gitlab, etc.)Be a technical specialist on GPU computing and networking products, directly supporting GPU customersDirect experience and strong knowledge of parallel programming, GPU CUDA/ROCm development, and applications.Actively partner with the R&D teams delivering services to our infrastructure to gather their service requirements to live within this infrastructure.Automate repetitive tasks and implement custom solutions using scripting/programming languages such as bash or pythonConfigure and troubleshoot a heterogeneous (QDR, FDR, EDR) InfiniBand network and associated subnet managerExperience with High-performance computer interconnects (e.g. 10 and 40 Gigabit Ethernet, InfiniBand)Able to move 50+ pounds#LI-MA1The annual salary range for California is $130,200 to $241,800. You may also be eligible to receive incentive compensation: bonus, equity, and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the salary range is a guideline and compensation may vary based on factors such as qualifications, skill level, competencies and work location. Our benefits programs include: paid vacation and paid holidays, 401(k) plan with employer match, employee stock purchase plan, a variety of medical, dental and vision plan options, and more.We’re doing work that matters. Help us solve what others can’t.SummaryLocation: SAN JOSE 07Type: Full time


  • Network Engineer

    4 weeks ago


    San Jose, United States Calsoft Pvt. Ltd. Full time

    What you'll be doing: Develop features and tools as part of solution engineering efforts to support all Enterprise Service offerings including, but not limited to InfiniBand/Ethernet switching products. Work with CLIENTEnterprise customers and internal users to improve the availability, reliability, and overall experience of working with CLIENTNetworking...


  • San Jose, United States LIGHTELLIGENCE Co., Ltd Full time

    Lightelligence is a venture-backed AI hardware company founded by MIT alumni, developing cutting-edge technology and products at the forefront of photonic computing and optical connectivity. The company has raised over $200M in pursuit of solving one of today’s most complex engineering challenges. With a culture of internal mobility, opportunites...


  • San Jose, California, United States LIGHTELLIGENCE Co., Ltd Full time

    Lightelligence is a venture-backed AI hardware company founded by MIT alumni, developing cutting-edge technology and products at the forefront of photonic computing and optical connectivity. The company has raised over $200M in pursuit of solving one of today's most complex engineering challenges. With a culture of internal mobility, opportunites abound to...


  • San Jose, United States Lightelligence Full time

    Lightelligence is a venture-backed AI hardware company founded by MIT alumni, developing cutting-edge technology and products at the forefront of photonic computing and optical connectivity. The company has raised over $200M in pursuit of solving one of today's most complex engineering challenges. With a culture of internal mobility, opportunites abound to...


  • San Jose, United States Lightelligence Full time

    Lightelligence is a venture-backed AI hardware company founded by MIT alumni, developing cutting-edge technology and products at the forefront of photonic computing and optical connectivity. The company has raised over $200M in pursuit of solving one of today's most complex engineering challenges. With a culture of internal mobility, opportunites abound to...

  • Network Engineer

    4 weeks ago


    San Francisco, United States OpenAI Full time

    About the Team You'll join the team responsible for scaling out the compute fleet that supports the models for ChatGPT and the API. The systems we support include GPU clusters, datacenter networking, hardware health, Infiniband performance, node lifecycle, and more. About the Role The ML compute team builds and maintains infrastructure abstractions allowing...

  • Network Engineer

    4 weeks ago


    San Francisco, United States OpenAI Full time

    About the Team You'll join the team responsible for scaling out the compute fleet that supports the models for ChatGPT and the API. The systems we support include GPU clusters, datacenter networking, hardware health, Infiniband performance, node lifecycle, and more. About the Role The ML compute team builds and maintains infrastructure abstractions allowing...

  • Network Engineer

    2 weeks ago


    San Francisco, California, United States OpenAI Full time

    About the Team You'll join the team responsible for scaling out the compute fleet that supports the models for ChatGPT and the API. The systems we support include GPU clusters, datacenter networking, hardware health, Infiniband performance, node lifecycle, and more. About the Role The ML compute team builds and maintains infrastructure abstractions allowing...


  • San Francisco, United States OpenAI Full time

    The Platform ML team builds the ML side of our state-of-the-art internal training framework used to train our cutting-edge models. We work on distributed model execution as well as the interfaces and implementation for model code, training, and inference.Our priorities are to maximize training throughput (how quickly we can train a new model) and researcher...


  • San Francisco, California, United States Genai Works Full time

    About the TeamThe Applied AI team safely brings OpenAI's technology to the world. We released ChatGPT, Plugins, DALL·E, and the APIs for GPT-4, GPT-3, embeddings, and fine-tuning. We also operate inference infrastructure at scale. There's a lot more on the immediate horizon.We seek to learn from deployment and distribute the benefits of AI, while ensuring...


  • San Francisco, United States Crusoe Full time

    Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation. We aim to align the long term interests of the climate with the future of global computing infrastructure. As data centers consume an exponentially growing power footprint to deliver technology to all connected devices, we are inspired by making sure...


  • San Francisco, United States Canonical Full time

    You will work across the full Linux stack from kernel through networking, virtualization and graphics to optimise Ubuntu, the world's most widely used Linux desktop and server, for the latest silicon. Our teams partner with specialist engineers from major silicon companies to integrate next-generation features and performance enhancements for upcoming...


  • San Francisco, California, United States Crusoe Energy Systems Full time

    Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.Take a look at what we do - We aim to align the long term interests of the climate with the future of global computing infrastructure. As data centers consume an exponentially growing power footprint to deliver technology to all connected devices, we...


  • San Francisco, California, United States Crusoe Energy Systems Full time

    Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.Take a look at what we do - We aim to align the long term interests of the climate with the future of global computing infrastructure. As data centers consume an exponentially growing power footprint to deliver technology to all connected devices, we...


  • San Mateo, United States RCM Life Sciences and IT Full time

    Job Title: Field Application Engineer Job Function: Machine Vision Job Type: Full Time/Perm Location: San Mateo, CA (Hybrid) Salary: $130K/yr plus equity About our Client RCM's client builds intelligent systems that autonomously interact with the real world and knows that poor vision limits how efficiently they learn, and how effectively they perform. Scope...


  • San Mateo, United States RCM Life Sciences and IT Full time

    Job Title: Field Application Engineer Job Function: Machine Vision Job Type: Full Time/Perm Location: San Mateo, CA (Hybrid) Salary: $130K/yr plus equity About our Client RCM's client builds intelligent systems that autonomously interact with the real world and knows that poor vision limits how efficiently they learn, and how effectively they perform. ...