Site Reliability Engineer
4 weeks ago
We are seeking a highly skilled Principal Site Reliability Engineer to join our team at Groq. As a Principal Site Reliability Engineer, you will be responsible for ensuring the reliability of our APIs as customers route their AI workloads through our insanely fast, purpose-built hardware and software systems.
Key Responsibilities:
- Enhance system reliability by refining operational practices to increase uptime and resilience.
- Lead investigations to determine root causes of system failures and develop scripts to repair and automate the upkeep of infrastructure components.
- Implement comprehensive monitoring (tracing, metrics, logging, alerting) to swiftly pinpoint, diagnose, and resolve system issues.
- Collaborate closely with teams to facilitate their applications' transition to Kubernetes, ensuring seamless migrations and optimal performance.
- Develop tools and strategies to enhance application robustness against network and dependency issues.
- Aid teams in optimizing consistent system responsiveness to enhance customer interactions.
- Assist and enable engineering teams to have a clear understanding of how the infrastructure and their systems are performing.
- Write tooling (Go, Rust, C++, Python) to assist our infrastructure to self-diagnose and self-heal.
- Uphold the highest security standards to safeguard customer data, aligning with stringent compliance protocols.
- Manage critical system incidents as a first responder, ensuring swift resolution and comprehensive post-incident analyses with implemented remediations.
Requirements:
- Proven SRE experience in managing high-availability systems across multiple cloud or data center environments.
- Technical versatility in modern programming languages, including Golang, Rust, Python, C++, Terraform, Flux.
- Advanced understanding of Linux, Kubernetes, Cilium (eBPF), Prometheus, PostgreSQL, BigQuery.
- Robust knowledge in networking, high-performance computing, and observability.
- Experienced in developing software that orchestrates vast and complex systems and infrastructures.
- Strong familiarity with Git and modern GitOps-based software development lifecycle practices.
- Demonstrated ability to lead initiatives and collaborate effectively with dispersed teams to achieve rapid results.
- Skilled in dissecting and resolving multifaceted system and hardware issues.
- Possession of a passion for obsessing about performance and highly reliable software and hardware.
What We Offer:
- A competitive base salary as part of our comprehensive compensation package, which includes equity and benefits.
- The opportunity to work with a geo-agnostic company, meaning you work where you are, and exceptional candidates will thrive in asynchronous partnerships and remote collaboration methods.
- A commitment to hiring and promoting an exceptional workforce as diverse as the global populations we serve.
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Moveworks Full timeAbout MoveworksMoveworks is a leading AI-powered automation platform that helps businesses streamline their operations and improve employee productivity. Our innovative technology enables employees to find information and get support in one place, reducing costs and increasing efficiency.Job DescriptionWe are seeking a highly skilled Site Reliability...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Atlassian Full timeAbout the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Atlassian. As a Site Reliability Engineer, you will play a critical role in ensuring the performance, reliability, and scalability of our cloud-based services.ResponsibilitiesDesign, implement, and maintain scalable and reliable cloud infrastructureCollaborate with...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Atlassian Full timeAbout the RoleWe're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the performance and reliability of our services. You will work closely with our teams to identify and resolve issues, and develop solutions to improve our systems.Key Responsibilities:Investigate...
-
Senior Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Groq Full timeUnlock the Power of AI with GroqWe're on a mission to democratize access to AI, and we need your expertise to make it happen. As a Senior Site Reliability Engineer at Groq, you'll play a critical role in ensuring the reliability, scalability, and performance of our tools and services.Key Responsibilities:Design and implement scalable and reliable...
-
Staff Site Reliability Engineer
3 weeks ago
Mountain View, California, United States Moveworks Full timeAbout the RoleMoveworks is the universal AI copilot for search and automation across all your business applications. We give employees one place to go to find information and get support while reducing costs for your business. The Moveworks Copilot is powered by an industry-leading Reasoning Engine that uses a combination of public and proprietary language...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Tik Tok Full timeAbout TikTok U.S. Data SecurityTikTok is a leading destination for short-form mobile video, inspiring creativity and bringing joy to millions of users worldwide. Our mission is to empower creators and communities to express themselves authentically, while ensuring the security and integrity of our platform.Job SummaryWe are seeking a highly skilled Site...
-
Senior Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Groq Full timeJob DescriptionAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. As a Senior Site Reliability Engineer, you'll play a critical role in ensuring the reliability, scalability, and performance of our tools and services.Responsibilities:Design and implement scalable and reliable architectures...
-
Senior Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Groq Full timeUnlock the Power of AI with GroqAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness, empowering a world where AI is universally accessible.Join Our MissionWe're seeking a Senior Site...
-
Site Reliability Engineer
3 weeks ago
Mountain View, California, United States Insight Global Full timeSite Reliability Engineer Opportunity in the Bay AreaWe are seeking a highly motivated Site Reliability Engineer to join our team in the Bay Area. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud infrastructure.Key Responsibilities:* Strong Linux System Admin fundamentals (bash/shell...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States NewsBreak Full time{"h1": "Transform Local News with NewsBreak", "p": "At NewsBreak, we're revolutionizing the way users interact with local news and their communities. Our mission is to foster safer, more vibrant, and authentically connected lives through robust collaborations with local publishers and businesses across the nation. As a Site Reliability Engineer, you'll play...
-
Platform Site Reliability Engineer
2 months ago
Mountain View, California, United States Samsung Electronics America North America Full timeJob Title: Platform Site Reliability EngineerSamsung Ads is a thriving business poised for even greater success, and we're looking for a passionate leader to join our Global Ads Product & Engineering team.About the RoleWe're the innovators behind the products, tech, and tools driving ad-based monetization. As a Site Reliability Engineer specializing in...
-
Site Reliability Engineer, Data Engineering
4 weeks ago
Mountain View, California, United States Tik Tok Full timeAbout the Role:This is a Site Reliability Engineer position focusing on data pipeline reliability for the Video Platform team in USDS.Data SREs monitor data and keep production batch and real-time processing jobs up and running with the highest level of availability, ensuring our users have the freshest, complete, and correct data...
-
Principal Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Groq Full timeAbout GroqGroq is a company that believes in an AI economy powered by human agency. We envision a world where AI is accessible to all, and we're working towards making that a reality.Job DescriptionWe're looking for a Principal Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Atlassian Full timeJob SummaryWe are seeking a highly skilled Site Reliability Engineer to join our team at Atlassian. As a Site Reliability Engineer, you will be responsible for ensuring the performance and reliability of our services, as well as addressing root causes of incidents and reducing incident rates.You will work closely with our development teams to identify and...
-
Principal Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Groq Full timeJob Title: Principal Site Reliability EngineerAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness, empowering AI applications to reach new heights.Job Summary:We're seeking a seasoned...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Tik Tok Full timeJob SummaryTikTok is seeking a highly skilled Site Reliability Engineer - Edge Services to join our team. As a key member of our Edge SRE team, you will be responsible for ensuring the reliability, fault-tolerance, and scalability of our edge services.ResponsibilitiesDesign and implement solutions to optimize edge service performance and remove...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Atlassian Full timeJob SummaryWe are seeking a highly skilled Site Reliability Engineer to join our team at Atlassian. As a Site Reliability Engineer, you will play a critical role in ensuring the performance and reliability of our services.Responsibilities:Improve the performance and reliability of servicesAddress root causes of incidents and reduce incident ratesDeep dive...
-
Site Reliability Engineer, Data Platform USDS
4 weeks ago
Mountain View, California, United States Tik Tok Full time{"h1": "Site Reliability Engineer, Data Platform USDS", "p": "At TikTok, we're passionate about creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe, and so does our workplace. As a Site Reliability Engineer in the Data Platform area, you'll have the...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Tik Tok Full timeAbout TikTok U.S. Data SecurityTikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. U.S. Data Security (USDS) is a subsidiary of TikTok in the U.S. focused on providing oversight and protection of the TikTok platform and U.S. user data.Job SummaryWe are seeking a highly skilled Site Reliability...
-
Site Reliability Engineer
4 weeks ago
Mountain View, California, United States Samsung Electronics America North America Full timeSite Reliability Engineer - DevOps InfrastructureAt Samsung Ads, we're transforming the advertising landscape with cutting-edge technology. As a Site Reliability Engineer - DevOps Infrastructure, you'll play a crucial role in ensuring the reliability, scalability, and performance of our advertising technology platform.Key Responsibilities:Design and...