L3 Production Support Engineer

7 days ago


San Francisco, California, United States Calfus Full time

PuneApply NowAbout CalfusAt Calfus, we are known for delivering cutting-edge AI agents and products that transform businesses in ways previously unimaginable. We empower companies to harness the full potential of AI, unlocking opportunities they never imagined possible before the AI era. Our software engineering teams are highly valued by customers, whether start-ups or established enterprises, because we consistently deliver solutions that drive revenue growth. Our ERP solution teams have successfully implemented cloud solutions and developed tools that seamlessly integrate with ERP systems, reducing manual work so teams can focus on high-impact tasks.None of this would be possible without talent like you Our global teams thrive on collaboration, and we're actively looking for skilled professionals to strengthen our in-house expertise and help us deliver exceptional AI, software engineering, and solutions using enterprise applications.As one of the fastest-growing companies in our industry, we take pride in fostering a culture of innovation where new ideas are always welcomed—without hesitation. We are driven and expect the same dedication from our team members. Our speed, agility, and dedication set us apart, and we perform best when surrounded by high-energy, driven individuals.To continue our rapid growth and deliver an even greater impact, we invite you to apply for our open positions and become part of our journeyAbout the role:The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.What You'll Do:Incident Management & LeadershipOwn Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolutionLead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UIEstablish and refine incident response playbooks, runbooks, and escalation proceduresParticipate in on-call rotation as primary/secondary responder with accountability for critical systemsBackend & Infrastructure ExpertisePerform deep production troubleshooting: log analysis, distributed tracing, metric correlation, and profiling under pressureDiagnose and fix complex issues across microservices: scheduling engine, LLM orchestration, notification pipeline, and integrationsOptimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraintsArchitect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scaleWork with Kubernetes, container orchestration, and deployment pipelines; manage rollbacks and feature toggles during incidentsFull-Stack Incident ResolutionResolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)Debug and ship targeted React fixes when UI is the fastest path to incident resolutionDrive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handlingCollaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changesObservability & Continuous ImprovementDesign and tune monitoring, alerting, and SLO/SLI frameworks for the platformMaintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emergeMentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practicesDrive blameless post-mortems and systemic risk reduction across the platformOn your first day, we'll expect you to have:Backend (Primary Focus)5–8+ years in backend engineering with strong hands-on experience in Python/FastAPI or equivalentDeep knowledge of async APIs, background jobs, message queues (Celery, RabbitMQ, or similar), and distributed schedulingProduction-grade database skills: PostgreSQL query optimization, locking, migrations, and performance tuningRedis expertise: caching patterns, rate limiting, streams, and pub/sub for real-time systemsStrong observability and on-call mindset: designing alerts, understanding SLOs/SLIs, error budgets, and Sev definitionsProficiency with Kubernetes, Docker, container orchestration, and CI/CD pipelines (Jenkins, Bitbucket, GitHub Actions)Understanding of cloud infrastructure (Azure preferred) and networking fundamentalsLLM & Agentic SystemsSolid grasp of LLM orchestration concepts: prompt engineering, tool-calling, context windows, rate limits, and vendor-specific behaviorExperience with LLM failure modes: hallucinations, token limits, timeout patterns, and cost/latency tradeoffsKnowledge of agent frameworks (LangGraph, similar) and how they compose across microservicesAbility to debug LLM-driven flows: tracing prompts, understanding retry/backoff behavior, and validating tool outputsFull-Stack (Secondary but Required)2–3+ years hands-on with React and TypeScript in production environmentsCompetency reading and modifying existing React code: components, hooks, routing, state management (Redux/Context)Browser debugging skills: DevTools, React DevTools, network throttling, and performance profilingAbility to implement targeted UI fixes: form validation, error handling, API error display, and minor UX hardeningFamiliarity with frontend build pipelines: Webpack/Vite, environment configs, feature flags, and deployment strategiesLogging, Metrics & TroubleshootingExpert-level log parsing and correlation across services using structured logging (JSON, correlation IDs)Proficiency with observability platforms (Prometheus, Grafana, Datadog, New Relic, or similar)Ability to construct and execute production queries under incident time pressureStrong shell scripting (bash/Python) for diagnostics, automation, and custom monitoringRequired Soft SkillsIncident command maturity: composure under pressure, clear communication, and decisive decision-making during critical outagesTechnical depth with breadth: deep backend knowledge + sufficient full-stack awareness to own end-to-end incidentsMentorship mindset: capable of raising L2 engineers through code review, pairing, and RCA participationDocumentation discipline: ability to capture runbooks, architecture decisions, and lessons learned clearlyCross-functional collaboration: working effectively with dev, SRE, platform, and business teams during incidentsExperience RequirementsMinimum 6–10 years in backend/platform/SRE roles with at least 3+ years in production support, incident response, or on-call engineeringProven track record leading Sev-1/Sev-2 incidents in distributed, multi-service systemsExperience with at least one agentic AI or LLM-integrated product (customer-facing or internal tools)Comfortable with continuous on-call rotation and on-demand availability for critical incidentsNice-to-HaveExperience with on-call/incident management platforms (PagerDuty, Squadcast, Opsgenie, or custom solutions)Familiarity with RBAC, SSO, and authentication/authorization patternsKnowledge of RAG (Retrieval Augmented Generation) systemsSuccess MetricsIncident resolution: Mean Time to Resolution (MTTR) for Sev-2/3 incidents and escalation quality for Sev-1 incidentsRunbook effectiveness: % of L2 team successfully using documented runbooks without L3 escalationRCA quality: systemic issues identified and fixed; Sev-1 recurrence rate < 1% within 30 daysMentorship impact: L2 engineers able to independently handle higher-complexity issues over 6–12 monthsOn-call reliability: response times, ticket accuracy, and team feedback on L3 support qualityBack


  • Software Engineer L3

    2 weeks ago


    San Francisco, California, United States HiringAgents Full time

    Job title: Junior Backend Software EngineerClient: VoxelLocation: San Francisco, California, United States - On-SiteContract type: Full-time, permanentContract duration:Salary:About The RoleIndustrial labor is incredibly dangerous work—almost 3 million people in the US per year are injured in the workplace for entirely preventable and at times, fatal or...


  • San Francisco, California, United States Salesforce Full time

    To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts.Job CategorySoftware EngineeringJob DetailsAbout SalesforceSalesforce is the #1 AI CRM, where humans with agents drive customer success together. Here, ambition meets action. Tech meets trust. And innovation isn't...


  • San Francisco, California, United States Sephora Full time

    Job ID: 279186Store Name/Number: CA-FSC SF Off (0174)Address: 350 Mission St, 20th Floor, San Francisco, CA 94105, United States (US)Hourly/Salaried: Salaried (Exempt)Job Type: Full TimePosition Type: RegularJob Function: Information TechnologyBelong to Something BeautifulAt Sephora, beauty is about feeling seen, valued, and empowered, individually and...


  • San Francisco, California, United States Cypress HCM Full time

    Senior Support Engineer (Founding Engineer)Location:Remote in the United StatesEmployees:35 |Industry:Data, Machine Learning, AI |Reports To:Vice President, EngineeringResponsibilitiesAs the first support engineer on the team, you will serve as the primary technical point of contact for our customers – troubleshooting issues, responding to...


  • San Francisco, California, United States Meter Full time

    Every firmware release we ship runs through our QA lab. The faster and more reliable that lab is, the faster Meter innovates. You'll own the physical and logical backbone that powers testing across all our products, from PDUs and Firewalls to Access points, enabling our quality and engineering teams to ship high-quality firmware faster.What Success Looks...


  • San Francisco, California, United States Axius Full time

    Company Description Job Description •Advanced level of server, desktop and remote support knowledge. This experience should include Administration of the following: Windows Server (2000, 2003, 2008), Active Directory, and other third party software and tools (Altiris, Ghost, Anti-Virus, vCloud, etc.)•Intermediate to Advanced understanding and...


  • San Francisco, California, United States Retool Full time

    ABOUT RETOOLNearly every company in the world runs on custom software for critical operations like tracking performance metrics, handling customer support workflows, building admin dashboards, and countless other processes you might not have even thought of. But most companies don't have adequate resources to properly invest in these tools, leading to a lot...

  • Product Engineer

    2 days ago


    San Francisco, California, United States Pharos Full time

    Company OverviewPharos is an early-stage startup dedicated to improving patient safety in hospitals through advanced AI-powered reporting and analytics. Our mission is to make healthcare safer by automating hospital quality reporting and helping staff identify and prevent the root causes of avoidable harm. Our vision is an AI system reviewing every chart at...


  • San Francisco, California, United States Rec Gen Full time

    Founding Technical Support EngineerFull-time | On-site in San Francisco | $110k–$165k + equityRec Gen is partnered with a fast-growing, YC-backed SaaS company that is transforming one of the world's largest, most underserved industries. With more than 400 paying customers and strong recurring revenue, they are scaling rapidly and hiring their first...

  • Product Engineer

    5 days ago


    San Francisco, California, United States Fuku Full time $150,000 - $400,000

    Product Engineer – New Product, First OwnerLocation: San Francisco, CA Compensation: $150,000 – $400, %–1% Equity Type: Full-Time Visa Sponsorship: H-1B, O-1, OPT Priority: High---About CompanyClient is a profitable, fast-growing, YC-backed company supported by top investors (GC, SV Angel, A Capital, Liquid 2, etc.). The company provides live company...