AI Ops Site Reliability Engineer
3 weeks ago
DescriptionTikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul and Tokyo.Why Join UsCreation is the core of TikTok's purpose. Our platform is built to help imaginations thrive. This is doubly true of the teams that make TikTok possible. Together, we inspire creativity and bring joy - a mission we all believe in and aim towards achieving every day. To us, every challenge, no matter how difficult, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always. At TikTok, we create together and grow together. That's how we drive impact - for ourselves, our company, and the communities we serve. Join us.Join our innovative Site Reliability Engineering (SRE) team that merges software development with infrastructure operations to manage large-scale, highly distributed systems. We leverage cutting-edge AI technology, such as Large Language Models (LLM), for efficiency and actively shaping the future of AI Ops technology.Key Responsibilities:- Develop and implement AI-based software for efficient and intelligent management of service-oriented architecture (SOA), driving research on ML algorithms, and leveraging AI technology to solve complex site reliability issues.- Explore practical applications of LLM technology in the field of AI Ops, providing algorithmic services such as intelligent interaction, root cause analysis, and anomaly detection.- Construct an LLM applications framework, integrate it into a unified SRE software platform, and provide intelligent services to enhance operational efficiency.- Continuously keep up with cutting-edge LLM technologies, open-source solutions, and their applications in the field of AI Ops.Qualifications- Bachelor's degree in Computer Science or equivalent, with 5 years of experience as an ML Engineer or ML Applied Scientist.- Experience with AI Ops, particularly with the stability of cloud platforms. This includes, but is not limited to, anomaly detection, log monitoring, fault diagnosis, and root cause analysis.- Proficiency in the algorithmic principles of mainstream large language models (such as GPT, ChatGPT, LLaMA), fine-tuning strategies, prompt engineering, vector databases, and application paradigms like LangChain.- Strong problem-solving and communication skills, excellent data sensitivity, and business understanding, capable of deriving valuable insights from complex business data.TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.TikTok is committed to providing reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at dataecommerce.accommodationstiktok.comRegularExperienced
-
Site Reliability Engineer
2 weeks ago
San Jose, United States HCLTech Full timeAbout HCLTech:HCLTech is a global technology company, home to 221,000+ people across 60 countries, delivering industry-leading capabilities centered around digital, engineering and cloud, powered by a broad portfolio of technology services and products. We work with clients across all major verticals, providing industry solutions for Engineering Services,...
-
San Francisco, United States Ponce Ai Full timeJob DescriptionJob DescriptionWhat to Expect:We are seeking a skilled and creative ML Ops Engineer to join our team. As an ML Ops Engineer you will be responsible for utilizing open-source diffusion image generation models to develop high-quality and visually appealing photorealistic images that incorporate the cosmetic medical procedure results. You’ll...
-
Director of Engineering, AI Platform
10 hours ago
San Francisco, United States Snorkel AI Full timeWe're on a mission to democratize AI by building the definitive AI data development platform. The AI landscape has gone through incredible change between 2016, when Snorkel started as a research project in the Stanford AI Lab, to the generative AI breakthroughs of today. But one thing has remained constant: the data you use to build AI is the key to...
-
Director of Engineering, AI Platform
4 weeks ago
San Francisco, United States Snorkel AI, Inc. Full timeWe are looking for a Director of Engineering to lead our AI Platform team. Our AI Platform team builds innovative software systems to power the Snorkel Flow platform. This includes services to train and serve generative AI and machine learning models using novel data-centric techniques, libraries to support AI workflows for a variety of data modalities and...
-
Site Reliability Engineer
1 day ago
San Francisco, United States Anthropic Full timeWe are looking for a Site Reliability Engineer who will ensure the high availability and performance of our Kubernetes clusters that power machine learning research and services. About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole....
-
Site Reliability Engineer
2 days ago
San Francisco, United States Anthropic Full timeWe are looking for a Site Reliability Engineer who will ensure the high availability and performance of our Kubernetes clusters that power machine learning research and services. About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole....
-
Site Reliability Engineer
7 hours ago
San Francisco, United States Instabase Full timeAt Instabase, we're passionate about democratizing access to cutting-edge AI innovation to enable any organization to solve previously unsolvable unstructured data problems in their industry. With customers representing some of the largest and most complex organizations in the world, and investors like Greylock, Andreessen Horowitz, and Index Ventures, our...
-
Site Reliability Engineer
1 week ago
San Jose, United States Myriad Consulting Inc Full timeThis role also open for junior (3+ yoe) candidates, and SRE lead (7+ yoe).Site Reliability Engineering(SRE) team combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In our team, you ll have the opportunity to manage the complex challenges of scale, while using expertise in coding,...
-
Site Reliability Engineer
2 days ago
San Francisco, United States Instabase Full timeAt Instabase, we're passionate about democratizing access to cutting-edge AI innovation to enable any organization to solve previously unsolvable unstructured data problems in their industry. With customers representing some of the largest and most complex organizations in the world, and investors like Greylock, Andreessen Horowitz, and Index Ventures, our...
-
Infrastructure and Site Reliability Engineer
2 weeks ago
San Francisco, California, United States Observable Full timeObservable is seeking a full-time infrastructure and site reliability engineer to help improve, administrate, and grow Observable systems as we scale to meet our customer's needs.What you will doPerform site reliability and ops work for Observable production and staging environments. (Manage servers Tweak WAF rules Optimize SQL queries And more)Design and...
-
Site Reliability Engineer
2 days ago
San Francisco, United States Talkdesk Full timeAt Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including...
-
Senior Site Reliability Engineer
6 hours ago
San Jose, United States HireIO Inc Full timeJob Description Job Description Introduction We are an all-in-one video editing solution that helps you create incredible videos. With the mission of making content creation easier and more engaging, we were first launched on mobile platforms in April 2020. In less than a year, we were released in Brazil, US, Indonesia, Japan and several other countries. To...
-
AI Engineer
2 weeks ago
San Jose, United States Diverse Lynx Full timeEngineer- AI/ AI/Client, Python, Linux, C/C++, Shell Scripting Bachelor/master's in computer science, computer engineering, data science/analytics, or a related field Strong Python programming skills Good C/C++ programming skills Excellent written/verbal communication skills Experience in a field associated with the deployment of AI/Client models Experience...
-
AI Engineer
1 month ago
San Jose, United States Diverse Lynx Full timeEngineer- AI/ AI/Client, Python, Linux, C/C++, Shell Scripting Bachelor/master's in computer science, computer engineering, data science/analytics, or a related field Strong Python programming skills Good C/C++ programming skills Excellent written/verbal communication skills Experience in a field associated with the deployment of AI/Client models Experience...
-
AI Engineer
3 weeks ago
San Jose, United States Diverse Lynx Full timeEngineer- AI/ AI/Client, Python, Linux, C/C++, Shell Scripting Bachelor/master's in computer science, computer engineering, data science/analytics, or a related field Strong Python programming skills Good C/C++ programming skills Excellent written/verbal communication skills Experience in a field associated with the deployment of AI/Client models Experience...
-
Sr Site Reliability Engineer
4 weeks ago
San Jose, United States Hireio, Inc. Full timeJob DescriptionJob DescriptionIntroductionWe are an all-in-one video editing solution that helps you create incredible videos. With the mission of making content creation easier and more engaging, we were first launched on mobile platforms in April 2020.In less than a year, we were released in Brazil, US, Indonesia, Japan and several other countries. To...
-
Senior Site Reliability Engineer
4 weeks ago
San Jose, United States Hireio, Inc. Full timeJob DescriptionJob DescriptionIntroduction We are an all-in-one video editing solution that helps you create incredible videos. With the mission of making content creation easier and more engaging, we were first launched on mobile platforms in April 2020. In less than a year, we were released in Brazil, US, Indonesia, Japan and several other countries. To...
-
Site Reliability Engineer
2 days ago
San Francisco, United States Talkdesk Full timeAt Talkdesk, we are courageous innovators focused on helping organizations around the world create better customer experiences. Our AI-powered cloud contact center solutions optimize our customers’ most critical customer service processes. We are recognized as a Contact Center as a Service (CCaaS) leader by influential research organizations including...
-
Site Reliability Engineer, Research Platform, SRE
19 hours ago
San Francisco, United States OpenAI Full timeAbout the team: Reliable services are what enables Open AI to train the best AI models in the world and to bring the promise of safe, effective AI to the world. The SRE team in research is responsible for defining, measuring, and improving the reliability of the research platform. The SRE team works closely with the supercomputing and hardware health teams...
-
Senior Site Reliability Engineer
2 days ago
San Jose, United States OKX Full timeWho We Are OKX is revolutionising world systems through our cutting-edge digital asset exchange, Web3 portal and blockchain ecosystems.We are deeply committed to shaping a fairer, more transparent and accessible society through blockchain technology and to date, we have 50+ million users, 3000+ employees and 180+ countries believing in the same vision as us....