Site Reliability Engineer

2 months ago


Mountain View, California, United States Tik Tok Full time
About the Role

We are seeking an experienced Site Reliability Engineer to join our USDS Video Platform team at TikTok. As a key member of our team, you will be responsible for ensuring the reliability and performance of our video system, which serves billions of users worldwide.

Responsibilities
  • Ensure the overall reliability of TikTok's video system, including video publishing and distribution.
  • Perform lifecycle management of production systems, including change management, service deployment, operations, and emergency response.
  • Monitor the system and respond to incidents to maintain system service level agreement (SLA), review and follow up all production incidents.
  • Perform capacity management of compute, storage, and network bandwidth resources to ensure system stability and save infrastructure costs.
  • Provide strong support during big events to ensure the system is capable of consuming a large volume of Internet traffic.
  • Build tools, automations, visualizations, and monitors to facilitate the operation and optimization of the global infrastructure.
Requirements
  • Bachelor's degree in Computer Science or a related technical background involving software/system engineering, or equivalent working experience.
  • 2+ years of SRE or DevOps experience in large-scale online services.
  • Programming experience with at least one of the following languages: C, C++, Java, Python, C#, or Go.
Preferred Qualifications
  • Extensive knowledge of networking, operation system, database system, and container technology.
  • Good understanding of every aspect of microservice architecture, and hands-on experience in troubleshooting in large-scale distributed systems.
  • Hands-on experience in common open-source systems such as Linux, MySQL, MongoDB, Redis, and ELK.
  • Experience in building solutions with AWS, Google, Azure, and other cloud services is a plus.
About TikTok

TikTok is a world-leading video platform that provides multimedia storage, delivery, and transcoding services. Our mission is to inspire creativity and bring joy to our users. We are committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives.

We are passionate about this mission and hope you are too. If you are passionate about ensuring software reliability, love problem-solving, and are prepared for exciting challenges, we would like you to join our team.



  • Mountain View, California, United States Moveworks Full time

    About MoveworksMoveworks is a leading AI startup that provides a universal AI copilot for search and automation across all business applications. Our mission is to empower employees to work faster and more efficiently by eliminating repetitive support issues and delivering instant knowledge.Job DescriptionWe are seeking a highly skilled Site Reliability...


  • Mountain View, California, United States Moveworks Full time

    About MoveworksMoveworks is a leading AI-powered automation platform that helps businesses streamline their operations and improve employee productivity. Our innovative technology enables employees to find information and get support in one place, reducing costs and increasing efficiency.Job DescriptionWe are seeking a highly skilled Site Reliability...


  • Mountain View, California, United States Atlassian Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Atlassian. As a Site Reliability Engineer, you will play a critical role in ensuring the performance, reliability, and scalability of our cloud-based services.ResponsibilitiesDesign, implement, and maintain scalable and reliable cloud infrastructureCollaborate with...


  • Mountain View, California, United States Tik Tok Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our Applied Machine Learning (AML) team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining highly available, scalable, and fault-tolerant systems.ResponsibilitiesDesign and develop large-scale systems that meet the needs of our AML...


  • Mountain View, California, United States Atlassian Full time

    About the RoleWe're seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the performance and reliability of our services. You will work closely with our teams to identify and resolve issues, and develop solutions to improve our systems.Key Responsibilities:Investigate...


  • Mountain View, California, United States Tik Tok Full time

    Job Title: Site Reliability Engineer, EdgeAt TikTok, we're committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe, and so does our workplace.About the RoleWe're seeking a highly skilled Site Reliability Engineer to join our Edge team. As a...


  • Mountain View, California, United States Groq Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Groq. As a key member of our infrastructure operations team, you will be responsible for ensuring the reliability, scalability, and performance of our tools and services.Key Responsibilities:Design and implement scalable and...


  • Mountain View, California, United States Groq Full time

    Unlock the Power of AI with GroqWe're on a mission to democratize access to AI, and we need your expertise to make it happen. As a Senior Site Reliability Engineer at Groq, you'll play a critical role in ensuring the reliability, scalability, and performance of our tools and services.Key Responsibilities:Design and implement scalable and reliable...


  • Mountain View, California, United States Tik Tok Full time

    About the RoleWe are seeking a skilled Site Reliability Engineer to join our AML team, where you will play a critical role in designing, building, and maintaining highly available, scalable, and fault-tolerant systems.ResponsibilitiesDesign and implement large-scale systems to ensure high availability and scalability.Monitor and analyze system performance,...


  • Mountain View, California, United States Moveworks Full time

    About the RoleMoveworks is the universal AI copilot for search and automation across all your business applications. We give employees one place to go to find information and get support while reducing costs for your business. The Moveworks Copilot is powered by an industry-leading Reasoning Engine that uses a combination of public and proprietary language...


  • Mountain View, California, United States Tik Tok Full time

    About TikTok U.S. Data SecurityTikTok is a leading destination for short-form mobile video, inspiring creativity and bringing joy to millions of users worldwide. Our mission is to empower creators and communities to express themselves authentically, while ensuring the security and integrity of our platform.Job SummaryWe are seeking a highly skilled Site...


  • Mountain View, California, United States Groq Full time

    Job DescriptionAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. As a Senior Site Reliability Engineer, you'll play a critical role in ensuring the reliability, scalability, and performance of our tools and services.Responsibilities:Design and implement scalable and reliable architectures...


  • Mountain View, California, United States Groq Full time

    Unlock the Power of AI with GroqAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness, empowering a world where AI is universally accessible.Join Our MissionWe're seeking a Senior Site...


  • Mountain View, California, United States Insight Global Full time

    Site Reliability Engineer Opportunity in the Bay AreaWe are seeking a highly motivated Site Reliability Engineer to join our team in the Bay Area. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud infrastructure.Key Responsibilities:* Strong Linux System Admin fundamentals (bash/shell...


  • Mountain View, California, United States NewsBreak Full time

    {"h1": "Transform Local News with NewsBreak", "p": "At NewsBreak, we're revolutionizing the way users interact with local news and their communities. Our mission is to foster safer, more vibrant, and authentically connected lives through robust collaborations with local publishers and businesses across the nation. As a Site Reliability Engineer, you'll play...


  • Mountain View, California, United States Samsung Electronics America North America Full time

    Job Title: Platform Site Reliability EngineerSamsung Ads is a thriving business poised for even greater success, and we're looking for a passionate leader to join our Global Ads Product & Engineering team.About the RoleWe're the innovators behind the products, tech, and tools driving ad-based monetization. As a Site Reliability Engineer specializing in...


  • Mountain View, California, United States Tik Tok Full time

    About the Role:This is a Site Reliability Engineer position focusing on data pipeline reliability for the Video Platform team in USDS.Data SREs monitor data and keep production batch and real-time processing jobs up and running with the highest level of availability, ensuring our users have the freshest, complete, and correct data...


  • Mountain View, California, United States Groq Full time

    About GroqGroq is a company that believes in an AI economy powered by human agency. We envision a world where AI is accessible to all, and we're working towards making that a reality.Job DescriptionWe're looking for a Principal Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in ensuring the...


  • Mountain View, California, United States Atlassian Full time

    Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our team at Atlassian. As a Site Reliability Engineer, you will be responsible for ensuring the performance and reliability of our services, as well as addressing root causes of incidents and reducing incident rates.You will work closely with our development teams to identify and...


  • Mountain View, California, United States Groq Full time

    Job Title: Principal Site Reliability EngineerAt Groq, we're revolutionizing the AI economy by making processing power more accessible, faster, and more affordable. Our Language Processing Unit (LPU) outpaces the GPU in speed, power, efficiency, and cost-effectiveness, empowering AI applications to reach new heights.Job Summary:We're seeking a seasoned...