Site Reliability Engineer

7 days ago


Memphis, Tennessee, United States xAI Full time
About xAI

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

As an SRE - Hardware Specialist, you will serve as a hardware reliability expert focused on firmware, hardware specifications, vendor relations, and failure analysis. You will proactively identify and resolve hardware issues, manage RMA processes, and stay ahead of emerging hardware technologies to support xAI's datacenter operations. This role demands deep technical expertise in hardware diagnostics, vendor negotiations, and forward-looking hardware evaluation.

Responsibilities
  • Analyze firmware packages and hardware specifications for upcoming releases to ensure compatibility, performance, and reliability in xAI's datacenter environment.
  • Investigate and diagnose hardware failures, including "grey failures" (ambiguous or intermittent issues), proving them as true hardware defects through rigorous testing and data analysis.
  • Manage vendor relationships, including initiating RMA (Return Merchandise Authorization) claims, negotiating beyond standard processes when necessary, and holding vendors accountable for resolutions.
  • Collaborate with Datacenter Operations Technicians to troubleshoot, repair, and optimize hardware systems in real-time.
  • Research and evaluate next-generation hardware technologies that are not yet released, providing insights and recommendations to inform xAI's infrastructure roadmap.
  • Develop and implement monitoring tools, scripts, and processes to detect hardware anomalies early and minimize downtime.
  • Document failure modes, RMA outcomes, and hardware evaluations to build a knowledge base for the team.
  • Participate in on-call rotations and incident response for hardware-related issues in the Memphis datacenter.
Required Qualifications
  • Bachelor's degree in Systems Engineering, Electrical Engineering, Computer Science, or a related field (or equivalent experience).
  • 5+ years of experience in hardware reliability engineering, preferably in high-performance computing or datacenter environments.
  • Proven expertise in firmware analysis, hardware specifications review, and release validation.
  • Strong experience with RMA processes, including filing claims, vendor negotiations, and pushing for resolutions outside standard protocols.
  • Demonstrated ability to diagnose and prove complex hardware failures, including grey or intermittent issues, using tools, logic analyzers, or diagnostic software.
  • Familiarity with datacenter hardware components (e.g., servers, GPUs, networking equipment) and emerging technologies.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and analysis.
  • Excellent problem-solving skills with a data-driven approach to reliability engineering.
  • Ability to work collaboratively with cross-functional teams, including operations technicians.
Preferred Qualifications
  • Experience in AI/ML infrastructure or supercomputing environments.
  • Knowledge of vendor ecosystems (e.g., NVIDIA, Dell, HP, Supermicro) and supply chain management.
  • Certifications in hardware engineering or reliability (e.g., CRE, CompTIA Server+).
  • Prior work in a fast-paced startup or tech company like xAI.

xAI is an equal opportunity employer.

California Consumer Privacy Act (CCPA) Notice



  • Memphis, Tennessee, United States xAI Full time

    About xAIxAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational...


  • Memphis, Tennessee, United States xAI Full time

    About xAIxAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity.We operate with a flat organizational structure....


  • Memphis, Tennessee, United States xAI Full time

    About xAIxAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity.We operate with a flat organizational structure....


  • Memphis, Tennessee, United States Draslovka Full time

    Maintenance Reliability Engineer Draslovka is seeking a Maintenance Reliability Engineer to join our site asset care team.   The maintenance reliability engineer will lead maintenance activities to ensure compliance with Recognized and Generally Accepted Good Engineering Practices (RAGAGEPs) to promote and protect equipment integrity.  This position will...


  • Memphis, Tennessee, United States IFF Full time $120,000 - $180,000 per year

    Job SummaryWe are seeking a highly skilled Electrical Engineer specializing in power distribution systems to join our Memphis manufacturing facility. This critical role focuses on ensuring the reliability, safety, and optimization of our electrical transmission and distribution infrastructure supporting 24/7 manufacturing operations. The successful candidate...

  • Site Supervisor

    5 days ago


    Memphis, Tennessee, United States Data Comm Services LLC Full time

    Job SummaryWe are seeking an experienced and motivated Site Supervisor to oversee construction project from inception to completion. The ideal candidate will have a strong background in Engineering and construction management, with expertise in scheduling, and quality control. This role requires excellent leadership skills to manage teams effectively while...

  • Maintenance Engineer

    12 hours ago


    Memphis, Tennessee, United States Draslovka Full time

    Maintenance Engineer Draslovka is seeking a Maintenance Engineer to join our site asset care team.   The maintenance engineer will apply engineering principles and analytical tools to ensure process assets operate efficiently and are maintained effectively.  This position will develop and refine maintenance strategies and programs, minimize downtime by...

  • Systems Engineer

    1 week ago


    Memphis, Tennessee, United States AutoZone Full time

    Title:    Systems EngineerDuties: This Systems Engineer will join the Data Sync team to develop applications in a state-of-the-art ecosystem in bidirectional data movement between centralized and distributed servers in real-time. You will use technology including GCP Pub/Sub, Storage Buckets, CDN, and VMWare Tanzu Application Service (Pivotal Cloud...


  • Memphis, Tennessee, United States Hunter by HiringAgents Full time

    Job title: RAN Support EngineerClient: Hunter ScoutsLocation: Memphis, Tennessee, United States - On-SiteContract type: ContractContract duration: 12-month contract (with potential extension)Salary:About the roleHunter Scouts is seeking an experienced RAN Support Engineer for a 12-month, fully on-site engagement at a customer location in Memphis, TN. You...


  • Memphis, Tennessee, United States Jabil Full time $90,000 - $120,000 per year

    At Jabil we strive to make ANYTHING POSSIBLE and EVERYTHING BETTER. We are proud to be a trusted partner for the world's top brands, offering comprehensive engineering, manufacturing, and supply chain solutions. With over 50 years of experience across industries and a vast network of over 100 sites worldwide, Jabil combines global reach with local expertise...