Current jobs related to Staff AI Infrastructure Site Reliability Engineer - Santa Clara, California - XPENG Motors


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: Senior Staff AI Infrastructure SREXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures cutting-edge EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.About the...


  • Santa Clara, California, United States Celestial AI Full time

    About Celestial AICelestial AI is a pioneering company in the field of Generative AI, data center infrastructure, and high-performance computing. As we navigate the era of Accelerated Computing, we recognize that data center bottlenecks are no longer limited to compute performance, but rather the system's interconnect bandwidth, memory bandwidth, and memory...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of artificial intelligence, machine learning, and datacenter acceleration. Our company has a rich history of innovation, with a legacy that dates back to the invention of the GPU in 1999. This groundbreaking technology sparked the growth of the PC gaming market, redefined modern computer graphics, and...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades, with our invention of the GPU in 1999 sparking the growth of the PC gaming market, redefining modern computer graphics,...


  • Santa Clara, California, United States Nvidia Full time

    NVIDIA is seeking a highly skilled and experienced engineer to join our growing team. The successful candidate will work at the intersection of GPU chip design and AI, responsible for the design, development, and maintenance of the infrastructure around Nvidia's internal large language model aimed at facilitating chip design.Key Responsibilities:Develop and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewPalo Alto Networks is seeking a highly skilled Cloud Infrastructure Engineer to join our CDL/SLS team. As a Senior Staff Site Reliability Engineer, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our team is at the forefront of innovation, constantly pushing the boundaries of what is...


  • Santa Clara, California, United States XPENG Motors Full time

    Job Title: AI Infrastructure Engineer - Scalable SolutionsXpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures smart EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.We are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Cloud Infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our ideal candidate will have a strong background in cloud computing, with...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of Cloud ServicesWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The SRE role focuses on...


  • Santa Clara, California, United States NVIDIA Full time

    As a Senior Manager in Site Reliability Engineering (SRE) at NVIDIA, you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Develop expertise in new technologies and contribute to the...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About UsPalo Alto Networks is a leader in the cybersecurity industry, dedicated to protecting the digital way of life. Our mission is to be the cybersecurity partner of choice, and we're looking for innovators who share our passion for shaping the future of cybersecurity.We're a company built on disruption, and we're looking for individuals who are...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team at Palo Alto Networks. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS, including EC2,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining scalable and reliable infrastructure for our cloud-based products.Key Responsibilities:Design and implement scalable and reliable infrastructure for...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our Cortex Data Lake team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Your CareerAs a Senior Staff Site Reliability Engineer, you will have the opportunity to...

Staff AI Infrastructure Site Reliability Engineer

2 months ago


Santa Clara, California, United States XPENG Motors Full time
Job Title: Senior Staff AI Infrastructure SRE

Xpeng Motors is a leading smart electric vehicle company that designs, develops, and manufactures smart EVs with advanced Internet, AI, and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers.

About the Role

We are seeking a Senior Staff AI Infrastructure SRE to lead the design and implementation of robust, cloud-native AI infrastructure solutions that support our autonomous driving initiatives. As a key technical leader, you will be instrumental in architecting and developing scalable, secure AI infrastructure on cloud-native platforms.

Responsibilities
  • Architect and lead the development of scalable, secure AI infrastructure on cloud-native platforms to support autonomous driving technologies
  • Collaborate closely with ML teams to facilitate seamless integration and optimal performance of AI algorithms
  • Identify and address system bottlenecks and instabilities, applying innovative solutions to enhance system reliability and efficiency
  • Foster technological advancements through research and implementation of state-of-the-art AI tools and methodologies
  • Act as a key technical leader and mentor, promoting a culture of technical excellence and collaborative innovation within the AI infrastructure team
Requirements
  • Bachelor's or Master's in Computer Science, Engineering, or related technical field
  • 5+ years of experience in designing, deploying, and managing GPU clusters for high-performance computing in AI applications, particularly within cloud environments
  • Proficient in cloud services (AWS, Azure, ALI Cloud) and building containerized applications using Kubernetes and Docker
  • Strong programming skills in Python, Golang, and experience with AI/ML frameworks (TensorFlow, PyTorch)
What We Offer
  • A fun, supportive, and engaging environment
  • Opportunity to make significant impact on transportation revolution by advancing autonomous driving
  • Opportunity to work on cutting-edge technologies with top talent in the field
  • Competitive compensation package
  • Snacks, lunches, and fun activities

The base salary range for this full-time position is $220,000-$370,000, in addition to bonus, equity, and benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training.

We are an Equal Opportunity Employer. It is our policy to provide equal employment opportunities to all qualified persons without regard to race, age, color, sex, sexual orientation, religion, national origin, disability, veteran status, or marital status or any other prescribed category set forth in federal or state regulations.