SRE/LLM Ops Engineer
3 weeks ago
At CluePoints, we’re redefining how clinical trials are run. As the premier provider of Risk-Based Quality Management (RBQM) and Data Quality Oversight software, we harness advanced statistics, artificial intelligence, and machine learning to ensure the quality, accuracy, and integrity of clinical trial data, helping life sciences organisations bring safer, more effective treatments to patients faster. We’re proud to be an ambitious, fast-growing technology scale‑up with a dynamic and diverse international team representing more than 40+ nationalities. Collaboration, flexibility, and continuous learning are part of our DNA. Role The SRE, LLMOps (AI Platform) ensures our LLM‑powered services are reliable, observable, and safe in production on Azure and Kubernetes. You’ll blend classic SRE disciplines with LLM‑specific operations: robust evaluation pipelines, prompt/version governance, model/vendor failover, guardrails, and cost/performance monitoring. You know how to build automation with LangChain/LangGraph, operate API‑based LLMs in production, and manage the inherent non‑determinism of models through rigorous testing and observability. What You'll Bring Experience: 5+ years in SRE/DevOps/Platform Engineering with 1–2+ years operating LLM or ML‑backed applications in production (API‑based or hosted models). LLMOps: hands‑on with LangChain/LangGraph building end‑to‑end chains/agents and RAG flows; comfort with vector stores (e.g., Azure AI Search, Pinecone), prompt/version control, and dataset tooling. Observability: proficiency instrumenting LLM traces and app telemetry, alert tuning, and root‑cause analysis; familiarity with LangSmith and/or Arize Phoenix (token/cost tracking, latency, failure modes). Cloud & platform: strong Azure and Kubernetes (AKS) background; GitOps (Flux/ArgoCD), Helm/Kustomize; CI/CD (GitHub Actions/GitLab/Jenkins); IaC (Terraform); secrets, networking, and security baselines. Languages & tooling: Python (preferred) and one of TypeScript/Go; REST/GraphQL; OpenAI/Azure OpenAI/Anthropic APIs; experience with Redis caches, message queues, and feature flags. What You'll Be Doing Instrument deep observability: implement tracing for LLM chains/agents (inputs, outputs, token usage, latency, model/version), correlate with app metrics/logs, and set actionable alerts; leverage LangSmith/Arize Phoenix (or similar) and OpenTelemetry where appropriate. Safety & guardrails: integrate content safety, PII redaction, jailbreak/prompt‑injection defenses, and policy‑based rails; document exceptions and reviewer workflows. Prefer native platform features (e.g., Azure AI Content Safety) or programmable rails (e.g., NVIDIA NeMo Guardrails). Cost & capacity management: monitor token and request costs, throughput, and rate limits; implement caching, request shaping, and multi‑tier model selection to balance quality, latency, and spend. Build evaluation & testing pipelines: create golden datasets and automated evals (offline + CI/CD + canary) to catch regressions from code, prompt, data, or model changes; use LangSmith/OpenAI Evals (or equivalents) to track quality trends over time. Platform operations on Azure/Kubernetes: ensure secure, compliant, and cost‑efficient operation; maintain IaC, secrets, networking, scaling, and DR/BCP; partner with Security and QA on regulated SaaS controls. Cross‑functional enablement: work with product/dev teams to set acceptance criteria for AI features, add runtime feature flags/kill‑switches, and embed evals/telemetry from day one. Comprehensive Health Insurance (medical, dental, and online consultations, 100% employee coverage) Life Insurance through UNUM Cafeteria Plan with flexible monthly credits for wellness, entertainment, and travel MultiSport Card, co‑financed 50/50 Employee Capital Plans (PPK) with 4% employer contribution A hub‑based hybrid model that blends flexibility with purpose — connecting teams through collaboration, learning, and a vibrant social culture. CluePoints is an equal opportunity employer committed to diversity and inclusion in the workplace. Your personal data will be processed by CluePoints for recruitment purposes in accordance with the Regulation (EU) 2016/679 (GDPR). If you wish for your data to be retained for future opportunities, please include the following statement in your CV: “I consent to the processing of my personal data by CluePoints for the purposes of future recruitment processes.” #J-18808-Ljbffr
-
Town of Poland, United States Experience One AG Full timeEin führendes Unternehmen im Bereich AI-Engineering sucht einen erfahrenen Principal AI Engineer, um agentische LLM-Experiences zu entwickeln. In dieser Rolle gestalten Sie innovative User-Interactions und verantworten die Implementierung von AI-Lösungen. Anforderungen umfassen exzellente Kenntnisse in Software Engineering, insbesondere in Python oder...
-
AI Solution Architect
2 weeks ago
Town of Poland, United States Fujitsu Full timePrincipal Solution Architect (AI Systems Lead) We are seeking a self‑driven AI Systems Lead to own vision‑to‑execution for enterprise AI initiatives, with a hands‑on focus on GenAI integration, security, and governance. You will define the AI vision with business stakeholders, translate it into an actionable roadmap, and lead cross‑functional...
-
Data Scientist
2 weeks ago
Town of Poland, United States Michael Page Full timeThe Data Scientist (AI/LLM Specialist) will play a pivotal role in driving innovation and leveraging advanced AI models, including LLM, to enhance the technology capabilities within the healthcare industry. Mid‑sized organisation operating within the healthcare industry, focused on using technology to improve healthcare solutions. Responsibilities Develop...
-
Senior AI Engineer: LLM, RAG
2 weeks ago
Town of Poland, United States EPAM Systems Full timeA global digital engineering provider is seeking an AI Engineer in New York. This role involves designing solutions using LLMs and collaborating with cross-functional teams to enhance AI capabilities. Candidates should have over 3 years of Python experience, proficiency in AI Agents design, and strong problem-solving skills. The company offers a flexible...
-
Senior AI Engineer
2 weeks ago
Town of Poland, United States EXUS Full timeA global technology company is seeking an AI Engineer to develop intelligent credit risk and collections systems. This remote-first role includes the opportunity for collaboration. The ideal candidate has over 5 years of experience in ML/AI, with a strong foundation in Python and LLM frameworks. Competitive salary from $100,000 to $170,000, with benefits...
-
SRE Engineer
3 weeks ago
Village of Waterloo, United States Cambio AI Inc. Full timeAs a Site Reliability Engineer (SRE), you will play a key role in establishing and enhancing our engineering platform. You will help ensure the reliability, scalability, and efficiency of our systems while developing tools that improve engineering productivity.About CambioCambio is a software platform for world-class real estate decarbonization. We help...
-
ML/LLM Technical Architect
2 weeks ago
Town of Poland, United States SoftServe Full timeSoftServe is a global digital solutions company, headquartered in Austin, Texas, and founded in 1993. With 2,000+ active projects across the USA, Europe, APAC, and LATAM, we deliver meaningful outcomes through bold thinking and deep expertise. Our people create impactful solutions, drive innovation, and genuinely enjoy what they do.The AI and Data Science...
-
Site Reliability Engineer
3 weeks ago
Town of Poland, United States DevOps projects Full timeSite Reliability EngineerJob OverviewAs a Site Reliability Engineer (SRE) at Ververica, you will design, provision, and maintain the infrastructure for Ververica’s Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure. Your role will involve architectural improvements, implementation ownership, and driving...
-
Site Reliability Engineer
2 weeks ago
Town of Poland, United States XM Full timeSite Reliability Engineers (SRE) - Multiple Openings The Role: You will join a team working with Observability, Escalations, Post-mortems, Correction of Errors, and other practices that will contribute to the company's goal of cloud resiliency. You will be responsible for driving processes around reliability, best practices, cultural change, and enforcement...
-
Site Reliability Engineer
3 weeks ago
Town of Poland, United States E-Solutions Full timeSite Reliability Engineer Build and maintain SRE dashboards using SLIs to measure and monitor SLO adherence. Define and implement auto-healing, resilient, and fault-tolerant systems from design through production. Serve as the primary contact for production application issues, coordinating with engineering teams to resolve incidents efficiently. Diagnose and...