Current jobs related to Infrastructure Observability Specialist - Fremont, California - Tesla


  • Fremont, California, United States Jobleads-US Full time

    Observability Software Engineer, AI InfrastructureJob Category: Location Fremont, CaliforniaReq. ID: 237249Job Type: Full-timeWhat to Expect:As a member of Tesla's "Insane Visibility" team, you will design, implement & maintain end-to-end observability across our AI Infrastructure stack and develop the framework to benchmark performance & processing of...


  • Fremont, California, United States Tesla Full time

    As a member of Tesla's "Insane Visibility" team, you will design, implement & maintain end-to-end observability across our AI Infrastructure stack and develop the framework to benchmark performance & processing of pipelines. You'll be responsible for building dashboards, alerts & monitoring necessary for Autopilot & AI teams to address observability issues...


  • Fremont, California, United States Jobleads-US Full time

    Job OverviewWe are seeking a highly skilled Observability Solutions Developer to join our AI Infrastructure team at Jobleads-US. This is an exciting opportunity to design, implement, and maintain end-to-end observability solutions across our AI Infrastructure stack.


  • Fremont, California, United States PRI Global Full time

    **About the Job:**PRI Global is seeking an IT Infrastructure Specialist to join our team. As a System Engineer, you will be responsible for designing, implementing, and maintaining large-scale IT systems.**Key Responsibilities:**Design and implement large-scale IT systems.Collaborate with project teams to ensure successful system implementation and...


  • Fremont, California, United States Info Way Solutions Full time

    Job Title: Windows Infrastructure SpecialistWe are seeking a highly skilled Windows Infrastructure Specialist to join our team at Info Way Solutions. As a key member of our infrastructure team, you will be responsible for developing Powershell scripts to configure Windows Domain Controllers and jumphost servers.This is an exciting opportunity to work with...


  • Fremont, California, United States Tesla Full time

    About the RoleAs a Mechanical Infrastructure Specialist, you will be responsible for applying your knowledge of engineering fundamentals and mechanical tools to solve technical problems and create novel designs for a wide range of mechanical and process systems. You will work closely with various stakeholders to develop optimized infrastructure upgrades that...


  • Fremont, California, United States AMAX Full time

    Job SummaryWe are seeking an experienced Computer Infrastructure Specialist to join our team. This is a unique opportunity to design and implement complex IT infrastructure solutions based on customer requirements.The ideal candidate will have a strong background in computer science, excellent problem-solving skills, and the ability to work independently...


  • Fremont, California, United States Jobleads-US Full time

    Job SummaryWe are seeking a highly skilled Senior Infrastructure Specialist to join our team at Jobleads-US. In this role, you will be responsible for leading the design, implementation, and maintenance of our network infrastructure.About the RoleYou will have a minimum of 7 years of experience in Infrastructure Skills, with a strong background in Juniper...


  • Fremont, California, United States Info Way Solutions Full time

    At Info Way Solutions, we are seeking a highly skilled IT Infrastructure Support Specialist to join our team. This role is responsible for the effective installation, configuration, and maintenance of systems hardware and software and related infrastructure.Key Responsibilities:Install new/rebuild existing servers and configure hardware, peripherals,...


  • Fremont, California, United States AMAX Full time

    About the RoleAs a Senior IT Infrastructure Specialist at AMAX, you will play a critical part in designing and implementing high-performance computing solutions for our clients. We are looking for a highly skilled professional with expertise in business systems analysis and software development lifecycles.Key Responsibilities:Conduct comprehensive business...


  • Fremont, California, United States Tesla Full time

    **Job Summary**Tesla is seeking a Cloud Platform Reliability Specialist to join our team in Fremont, California. As a key member of our cloud platform reliability team, you will play a critical role in ensuring the reliability and efficiency of our cloud infrastructure.**Key Responsibilities**Design and implement reliability and efficiency enhancements for...


  • Fremont, California, United States HTC Global Services Full time

    HTC Global Services wants you. Come build new things with us and advance your career. At HTC Global you'll collaborate with experts. You'll join successful teams contributing to our clients' success. You'll work side by side with our clients and have long-term opportunities to advance your career with the latest emerging technologies. At HTC Global Services...


  • Fremont, California, United States Jobleads-US Full time

    About the RoleWe are seeking a skilled DevOps Engineer to join our dynamic team. As a key member of our infrastructure team, you will be responsible for designing, developing, and managing CI/CD pipelines to facilitate fast and efficient development and deployment cycles.You will also be responsible for automating infrastructure provisioning, configuration,...


  • Fremont, California, United States AMAX Full time

    Job DescriptionWe're looking for a talented Cloud Engineer to join our team at AMAX.Your primary responsibility will be designing and managing cloud infrastructure for GPU hosting. This includes optimizing GPU performance and ensuring that systems run efficiently and securely.You'll work closely with cross-functional teams to deliver scalable cloud...


  • Fremont, California, United States Verrus Full time

    About Our TeamVerrus is a growing company that values innovation and collaboration. Our data center electrical engineering function is expanding, and we are seeking an experienced Data Center Infrastructure Specialist to join our team.You will be part of a highly cross-functional role, responsible for the design of the electrical plant through the entire...


  • Fremont, California, United States Western Digital Full time

    Key ResponsibilitiesThis Sr. Electrical Engineer - Power Systems Specialist will manage electrical power infrastructure projects from planning to completion, collaborating with internal teams and external vendors as needed.The ideal candidate will have experience in project management, electrical engineering, and facility operations, as well as excellent...


  • Fremont, California, United States Crystal Equation Corporation Full time

    Job Overview:Crystal Equation Corporation is seeking a highly skilled Space and Power Network Deployment Engineer to join our team. As a key member of our engineering department, you will be responsible for managing and supporting one of the world's largest and most complex networks.You will have a unique opportunity to be involved in POP Infrastructure...


  • Fremont, California, United States Info Way Solutions Full time

    Job Description:We are seeking an experienced AWS Solution Architect to join our team at Info Way Solutions. As a key member of our cloud infrastructure team, you will be responsible for designing and implementing scalable, secure, and high-performance AWS architectures using best practices and industry standards.Key Responsibilities:• Collaborate with...


  • Fremont, California, United States Lam Research Full time

    Job OverviewThe Global Operations Group at Lam Research brings together information systems, facilities, supply chain, logistics, and high-volume manufacturing to drive the engine of our global business operations. We help deliver industry-leading solutions with speed and efficiency, while actively supporting the resilient and profitable growth of our...


  • Fremont, California, United States Everest Consultants, Inc. Full time

    About the Role">Everest Consultants, Inc. is seeking a skilled Network Operations Engineer to join our team. As a key member of our IT infrastructure group, you will be responsible for ensuring the stability, security, and performance of our network services.Job Summary">This position requires a strong understanding of network hardware and software...

Infrastructure Observability Specialist

1 week ago


Fremont, California, United States Tesla Full time

Job Summary

The role of the Infrastructure Observability Specialist at Tesla is to design, implement, and maintain end-to-end observability across our AI infrastructure stack. The successful candidate will develop the framework to benchmark performance and processing of pipelines, ensuring these programs run smoothly throughout the full infrastructure stack.

Key Responsibilities

  • Design and develop observability solutions and tools, including monitoring, logging, and alerting systems, to improve system visibility and performance.
  • Create dashboards and automated alerts using tools such as Grafana, Prometheus, Splunk, Catchpoint to enhance monitoring frameworks and ensure proactive issue detection and resolution.
  • Analyze system metrics and logs to identify bottlenecks, optimize application performance, and ensure system reliability end-to-end while scaling.
  • Partner with developers, DevOps engineers, and AI Infra teams to integrate observability best practices into the development and deployment lifecycle.
  • Assist in troubleshooting and resolving production issues by leveraging observability data to identify root causes and implement preventative measures.
  • Develop scripts or workflows to automate routine tasks and improve observability tool integrations.
  • Create and maintain documentation for observability tools, processes, and workflows to ensure knowledge sharing and accessibility.

Requirements

  • 3+ years of experience in software engineering, DevOps, or SRE roles with a focus on observability or monitoring.
  • Proficiency in monitoring and visualization tools (e.g., Prometheus, Grafana, Splunk, Catchpoint).
  • Strong analytical and troubleshooting skills with a focus on system performance and reliability.
  • Working knowledge of high-performance computing, Slurm, GPU architecture, and networking.
  • Working knowledge of logging systems and distributed tracing frameworks such as OpenTelemetry.
  • Expertise in scripting languages (e.g., Python, Bash) and familiarity with configuration management tools (e.g., Terraform, Ansible).
  • Experience with containerized environments (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, Azure).
  • Strong analytical and troubleshooting skills with a focus on system performance and reliability.
  • Excellent verbal and written communication skills, with the ability to collaborate effectively across teams.
  • Bachelor's Degree in Computer Science, Software Engineering, or a related field, or equivalent experience.