Site Reliability Engineer
2 weeks ago
Overview As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a software engineer that applies sound engineering principles, operational discipline, and mature automation to our operating environments and codebase. Qualifications 5+ years of professional SRE or related experience Bachelor's degree in Computer Science or a related field or equivalent work experience Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes Proficiency in programming/scripting languages Direct experience in monitoring and observability practices Advanced knowledge of cloud services Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts Responsibilities Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability Build and run our infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users Build monitoring systems to ensure the highest quality service for our customers Design and implement operational processes (such as deployments and upgrades) Debug production issues across all services and levels of the stack Identify improvements for the product architecture from the reliability, performance and availability perspectives Plan the growth of Together AIs infrastructure About Together AI Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure. Compensation We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge. Equal Opportunity Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. Please see our privacy policy at #J-18808-Ljbffr
-
Site Reliability Engineer
1 week ago
San Francisco, CA, United States Writemed Full timeAbout Us Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care pathways? WriteMed.AI helps Biopharma and Life Sciences companies reduce time to write medical publications and regulatory paperwork. Submit your CV and...
-
Site Reliability Engineer
7 days ago
San Francisco, CA, United States Writemed Full timeAbout Us Would you like to join one of the fastest-growing organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital Technologies to advance drug development and improve patient care pathways? WriteMed.AI helps Biopharma and Life Sciences companies reduce time to write medical publications and regulatory paperwork. Want to make an...
-
Site Reliability Engineer
3 weeks ago
San Francisco, CA, United States P2P Full timeOur mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs, and tools necessary to build and scale onchain apps and rollups. Candidates should take the time to read all the elements of this job...
-
Site Reliability Engineer
3 weeks ago
San Francisco, CA, United States Air Apps Full timeJoin to apply for the Site Reliability Engineer (SRE) role at Air Apps Join to apply for the Site Reliability Engineer (SRE) role at Air Apps Get AI-powered advice on this job and more exclusive features. About Air Apps Are you ready to apply Make sure you understand all the responsibilities and tasks associated with this role before proceeding. At Air Apps,...
-
Software Engineering
2 weeks ago
San Francisco, CA, United States Jobright.ai Full timeJoin to apply for the Site Reliability Engineer - Inference role at Jobright.ai 2 days ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer - Inference role at Jobright.ai Get AI-powered advice on this job and more exclusive features. Jobright is an AI-powered career platform that helps job seekers discover the top...
-
Site Reliability Engineer
3 weeks ago
San Francisco, CA, United States ConductorOne Full timeConductorOne is the first AI-native identity security platform that protects every identity: human, non-human, and AI. With powerful automation, platform-level AI, and out-of-the-box connectors, it centralizes access visibility, enforces fine-grained controls, enables just-in-time access, and automates user access reviews across all apps. It’s easy to use,...
-
Site Reliability Engineer
1 week ago
San Francisco, CA, United States SOLANA FOUNDATION Full timeOur Mission Increase your chances of reaching the interview stage by reading the complete job description and applying promptly. Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs, and tools...
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Alchemy Full timeJoin to apply for the Site Reliability Engineer role at Alchemy Join to apply for the Site Reliability Engineer role at Alchemy Our Mission Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers the powerful APIs, SDKs,...
-
Site Reliability Engineer
2 weeks ago
San Francisco, United States Rivago Infotech Inc Full timeStaff Site Reliability Engineer (SRE) Job Responsibilities As our Staff SRE, you'll be the primary expert responsible for our entire compute ecosystem. Your key responsibilities will include: Design, implement, and lead large-scale, cross-functional projects to improve the reliability, performance, and efficiency of our core services and infrastructure (10×...
-
Site Reliability Engineer
3 weeks ago
San Francisco, United States Workos Full timeAbout WorkOS 🚀WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across North American time zones. We’re well-funded, having raised an $80M Series B. Our fast-growing customer base includes hundreds of rapidly...