Principal Site Reliability Engineer

11 hours ago


Mountain View, California, United States Groq Full time
About Groq

Groq is a company that believes in an AI economy powered by human agency. We envision a world where AI is accessible to all, and we're working towards making that a reality.

Job Description

We're looking for a Principal Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability and performance of our AI infrastructure.

Responsibilities
  • Enhance system reliability by refining operational practices to increase uptime and resilience.
  • Lead investigations to determine root causes of system failures and develop scripts to repair and automate the upkeep of infrastructure components.
  • Implement comprehensive monitoring (tracing, metrics, logging, alerting) to swiftly pinpoint, diagnose, and resolve system issues.
  • Collaborate closely with teams to facilitate their applications' transition to Kubernetes, ensuring seamless migrations and optimal performance.
  • Develop tools and strategies to enhance application robustness against network and dependency issues.
  • Aid teams in optimizing consistent system responsiveness to enhance customer interactions.
  • Assist and enable engineering teams to have a clear understanding of how the infrastructure and their systems are performing.
  • Write tooling (Go, Rust, C++, Python) to assist our infrastructure to self-diagnose and self-heal.
  • Uphold the highest security standards to safeguard customer data, aligning with stringent compliance protocols.
  • Manage critical system incidents as a first responder, ensuring swift resolution and comprehensive post-incident analyses with implemented remediations.
Requirements
  • Proven SRE experience in managing high-availability systems across multiple cloud or data center environments.
  • Technical versatility in modern programming languages, including Golang, Rust, Python, C++, Terraform, Flux.
  • Advanced understanding of Linux, Kubernetes, Cilium (eBPF), Prometheus, PostgreSQL, BigQuery.
  • Robust knowledge in networking, high-performance computing, and observability.
  • Experienced in developing software that orchestrates vast and complex systems and infrastructures.
  • Strong familiarity with Git and modern GitOps-based software development lifecycle practices.
  • Demonstrated ability to lead initiatives and collaborate effectively with dispersed teams to achieve rapid results.
  • Skilled in dissecting and resolving multifaceted system and hardware issues.
  • Passion for obsessing about performance and highly reliable software and hardware.
Compensation and Benefits

At Groq, a competitive base salary is part of our comprehensive compensation package, which includes equity and benefits. For this role, the base salary range is $165,200 to $332,300, determined by your skills, qualifications, experience, and internal benchmarks.

Location

Groq is a geo-agnostic company, meaning you work where you are. Exceptional candidates will thrive in asynchronous partnerships and remote collaboration methods. Some roles may require being located near our primary sites, as indicated in the job description.

About Us

Groq is an equal opportunity employer committed to diversity, inclusion, and belonging in all aspects of our organization. We value and celebrate diversity in thought, beliefs, talent, expression, and backgrounds. We know that our individual differences make us better.

Qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, gender, sexual orientation, gender identity, disability, or protected veteran status. We also take affirmative action to offer employment opportunities to minorities, women, individuals with disabilities, and protected veterans.



  • Mountain View, California, United States Groq Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Groq. As a Principal Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our tools and services for provisioning and managing the full lifecycle of Groq hardware and...


  • Mountain View, California, United States Groq Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Groq. As a Principal Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our tools and services for provisioning and managing the full lifecycle of Groq hardware and...


  • Mountain View, California, United States Groq Full time

    Job Title: Principal Site Reliability EngineerWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Groq. As a Principal Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our tools and services for provisioning and managing the full lifecycle of Groq hardware and...


  • Mountain View, California, United States Groq Full time

    About the RoleGroq is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a Principal Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our APIs, ensuring seamless performance and exceptional service delivery.Key ResponsibilitiesEnhance system reliability by refining operational...


  • Mountain View, California, United States Groq Full time

    Job Title: Principal Site Reliability EngineerAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness, empowering AI applications to reach new heights.Job Summary:We're seeking a seasoned...


  • Mountain View, California, United States Groq Full time

    Job Title: Principal Site Reliability EngineerAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness. As a Principal Site Reliability Engineer, you'll play a crucial role in ensuring the...


  • Mountain View, California, United States Groq Full time

    Job DescriptionAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness. As a Site Reliability Engineer, you'll play a crucial role in ensuring the reliability, scalability, and performance of...


  • Mountain View, California, United States Optomi Full time

    Job Title: Site Reliability EngineerOptomi, in partnership with a large consulting firm, is seeking an experienced Site Reliability Engineer for their Remote team. This position requires a versatile, highly motivated individual capable of supplying frontline technical and operational support to our Site Reliability teams.As a vital part of the Reliability...


  • Mountain View, California, United States Moveworks Full time

    About MoveworksMoveworks is a leading AI startup that provides a universal AI copilot for search and automation across all business applications. Our mission is to empower employees to work faster and more efficiently by eliminating repetitive support issues and delivering instant knowledge.Job DescriptionWe are seeking a highly skilled Site Reliability...


  • Mountain View, California, United States Moveworks Full time

    About MoveworksMoveworks is a leading AI-powered automation platform that helps businesses streamline their operations and improve employee productivity. Our innovative technology enables employees to find information and get support in one place, reducing costs and increasing efficiency.Job DescriptionWe are seeking a highly skilled Site Reliability...


  • Mountain View, California, United States Atlassian Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Atlassian. As a Site Reliability Engineer, you will play a critical role in ensuring the performance, reliability, and scalability of our cloud-based services.ResponsibilitiesDesign, implement, and maintain scalable and reliable cloud infrastructureCollaborate with...


  • Mountain View, California, United States Moveworks Full time

    About MoveworksMoveworks is a leading AI startup that provides a universal AI copilot for search and automation across all business applications. Our mission is to empower employees to work faster and more efficiently by eliminating repetitive support issues and delivering instant knowledge.Job DescriptionWe are seeking a highly skilled Site Reliability...


  • Mountain View, California, United States Optomi Full time

    Optomi's Site Reliability Engineer OpportunityWe are seeking a skilled Site Reliability Engineer to join our team at Optomi, in partnership with a large consulting firm. This role requires a versatile and highly motivated individual who can provide frontline technical and operational support to our Site Reliability teams.Key Responsibilities:Collaborate with...


  • Mountain View, California, United States Tik Tok Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our Applied Machine Learning (AML) team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining highly available, scalable, and fault-tolerant systems.ResponsibilitiesDesign and develop large-scale systems that meet the needs of our AML...


  • Mountain View, California, United States Synopsys Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our Platform Team at Synopsys. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and performance of our engineering environment. You will work closely with our development teams to design, implement, and operate scalable and efficient...


  • Mountain View, California, United States Atlassian Full time

    About the RoleWe're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the performance and reliability of our services. You will work closely with our teams to identify and resolve issues, and develop solutions to improve our systems.Key Responsibilities:Investigate...


  • Mountain View, California, United States Tik Tok Full time

    Job Title: Site Reliability Engineer, EdgeAt TikTok, we're committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe, and so does our workplace.About the RoleWe're seeking a highly skilled Site Reliability Engineer to join our Edge team. As a...


  • Mountain View, California, United States Tik Tok Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our AML team, where you will play a critical role in designing, building, and maintaining highly available, scalable, and fault-tolerant systems.ResponsibilitiesDesign and develop large-scale systems that meet the needs of our users.Monitor and analyze system performance,...


  • Mountain View, California, United States Groq Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Groq. As a key member of our infrastructure operations team, you will be responsible for ensuring the reliability, scalability, and performance of our tools and services.Key Responsibilities:Design and implement scalable and...


  • Mountain View, California, United States Groq Full time

    Unlock the Power of AI with GroqWe're on a mission to democratize access to AI, and we need your expertise to make it happen. As a Senior Site Reliability Engineer at Groq, you'll play a critical role in ensuring the reliability, scalability, and performance of our tools and services.Key Responsibilities:Design and implement scalable and reliable...