Global Head of Site Reliability Engineering

4 days ago

Remote, Oregon, United States Socure Full time

Why Socure?

At Socure, we're on a mission—to verify 100% of good identities in real time and eliminate identity fraud from the internet.

Using predictive analytics and advanced machine learning trained on billions of signals to power RiskOS, Socure has created the most accurate identity verification and fraud prevention platform in the world. Trusted by thousands of leading organizations—from top banks and fintechs to government agencies—we solve real, high-impact problems at scale. Come join us

Overview

Socure is the leader in digital identity verification and fraud prevention. We are hiring a bold, hands-on Global Head of Site Reliability Engineering (SRE) to own end-to-end reliability for the platform that powers identity, fraud, and compliance decisions for thousands of organizations across regulated industries. You will lead the global reliability charter for our mission‑critical services and data platform, including public sector programs where Socure is authorized at FedRAMP Moderate and operates in AWS GovCloud (US).

You will set the strategy and build the systems that keep Socure always‑on: multi‑region resilience, graceful degradation, disaster readiness, and real‑time observability across a fast‑evolving stack. You will also lead our red‑team quality assurance function to design and run chaos engineering experiments that harden our infrastructure, data, and application layers under real‑world failure conditions. You will own developer experience for reliability—leading company‑wide CI/CD pipelines, release engineering, and ephemeral environments for rapid, isolated testing—and drive an AI‑first SRE strategy, applying machine learning for anomaly detection, adaptive alerting, automated runbooks, incident summarization, and capacity forecasting.

Socure's platform safeguards highly sensitive, confidential data at massive scale, with workloads that demand low‑latency decisioning and continuous availability. This role is for a systems builder and culture carrier who has operated at the frontier of scale, reliability, and safety.

Why this role is compelling

Own global reliability for a platform trusted by financial institutions, fintechs, marketplaces, telecom, healthcare, and public sector programs—where availability, integrity, and clear evidence are non‑negotiable.
Steer reliability strategy for RiskOS, our risk orchestration engine that unifies identity, fraud, and compliance decisions and integrates a broad partner ecosystem—so improvements compound across every product and integration.
Lead with real impact: institutionalize best practices from large‑scale cloud incidents into a next‑generation reliability program that measurably improves uptime, latency, and time‑to‑recovery.
Shape developer experience at scale: own our CI/CD ecosystem, ephemeral test environments, and change‑management controls that enable safer, faster delivery for all engineering teams.
Work at the platform frontier: a real‑time Identity Graph, a powerful orchestration engine with deep explainability, and a modernization program toward product‑aligned, multi‑account AWS architecture with parity across commercial and GovCloud environments.

What you'll do

Define the global reliability strategy and roadmap across availability, latency, durability, data integrity, cost efficiency, and safety—mapped to clear business outcomes and service level objectives.
Architect multi‑region, multi‑zone resilience patterns with automated failover, graceful degradation, and progressive delivery; validate readiness through continuous game days and fault‑injection experiments.
Build and lead a world‑class red‑team QA and chaos engineering program across infrastructure, data pipelines, and applications; codify attack playbooks and steady‑state guardrails to improve detection and recovery.
Establish a unified observability practice: end‑to‑end tracing, high‑signal alerting, health and saturation indicators, user‑journey telemetry, and incident command protocols—standardized into a single, actionable operations view.
Drive rigorous incident management: real‑time incident command, rapid mitigation, blameless post‑incident reviews, durable corrective actions, and automated safeguards.
Ensure public sector readiness and continuous authorization: sustain FedRAMP Moderate posture, prove environmental parity between commercial and GovCloud, and strengthen controls for data residency, deletion, and audit evidence.
Partner with product engineering to make reliability a product feature: embed reliability patterns into RiskOS workflows and make Identity Graph‑based decisions observable, explainable, and resilient by default.
Lead developer tooling and release engineering: own CI/CD pipelines, test sandboxes and ephemeral environments, and the golden paths that make shipping changes safe, repeatable, and fast.
Advance an AI‑first SRE strategy: deploy ML for anomaly detection, incident prediction, adaptive alerting, automated runbooks, incident summarization, and capacity forecasts; measure impact via concrete reliability and efficiency wins.
Lead capacity planning and performance engineering across compute, storage, and networking—delivering consistently low‑latency decisions at peak volumes.
Attract, grow, and retain exceptional reliability engineers and leaders across regions; run a humane, effective, continuously improving on‑call program.

What you'll bring

Deep experience leading reliability for large‑scale, always‑on platforms with highly sensitive data—owning availability, latency, durability, and security across multiple product lines and regions.
Mastery in modern cloud architecture (AWS), product‑aligned multi‑account patterns, real‑time observability, progressive delivery, and automated disaster recovery—with a track record of measurable reliability gains.
Experience building red‑team and chaos engineering programs that surface systemic weaknesses, improve mean time to mitigate, and harden systems over time.
Proven leadership of developer tooling at scale: CI/CD, release engineering, and ephemeral environment strategies that increase velocity while reducing risk.
Strong partnership with product, data, and security; fluency in data lifecycle, retention and deletion, privacy, and governance for regulated industries and public sector.
A people‑first leadership style: you raise the bar on hiring and mentoring, set crisp principles, and build an ownership culture grounded in curiosity, accountability, and continuous learning.

Socure is an equal opportunity employer that values diversity in all its forms within our company. We do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
If you need an accommodation during any stage of the application or hiring process—including interview or onboarding support—please reach out to your Socure recruiting partner directly.

YouTube | LinkedIn | X (Twitter) | Facebook

Site Reliability Engineer

5 days ago

Remote, Oregon, United States Cutover Full time

An inclusive work environment is an empowering one. At Cutover, we lead with empathy and enable others to succeed through curiosity, kindness, and self-expression.Location: Remote, United StatesThis role requires on-call shifts, roughly 1 in 4 weeks and 1 in 4 weekends - 2nd Shift: 2:00pm -11:00pm PST (10:00 PM - 7:00 AM UTC)Cutover provides enterprise...
Senior Site Reliability Engineer

6 days ago

Remote, Oregon, United States Maxihost Full time

About 's global computing platform was launched in 2019, enabling businesses to programmatically deploy single-tenant Bare Metal instances in different parts of the world. We are a team of passionate individuals about hardware, software, and network infrastructure looking to build the fastest, easiest-to-use, developer-centric single-tenant Cloud...
Staff Site Reliability Engineer

2 weeks ago

Remote, Oregon, United States AlphaSense Full time

About AlphaSense: The world's most sophisticated companies rely on AlphaSense to remove uncertainty from decision-making. With market intelligence and search built on proven AI, AlphaSense delivers insights that matter from content you can trust. Our universe of public and private content includes equity research, company filings, event transcripts, expert...
Site Reliability Engineer

2 weeks ago

Remote, Oregon, United States ADT Full time $200,000 - $250,000 per year

ADT is transitioning to an in-office model. New team members will work from home but should plan to return to an in-office model at a later date. We will keep you well informed and supported throughout the transition.Summary:We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. As an SRE, you will be responsible for...
Site Reliability Engineer

2 weeks ago

Remote, Oregon, United States JWay Group Full time

Sr. Site Reliability Engineer, Stack ManagementAs a Site Reliability Engineer, you will be responsible for architecting, maintaining, and managing our client's infrastructure which includes solving some of the most challenging cloud access and data security problems for enterprise customers.Job ResponsibilitiesMaintain and support existing IT infrastructure...
Site Reliability Engineer, SaaS

2 weeks ago

Remote, Oregon, United States Veeam Software Full time

Veeam, the #1 global market leader in data resilience, believes businesses should control all their data whenever and wherever they need it. Veeam provides data resilience through data backup, data recovery, data portability, data security, and data intelligence. Based in Seattle, Veeam protects over 550,000 customers worldwide who trust Veeam to keep...
Site Reliability Engineer

6 days ago

Remote, Oregon, United States 2Prod Technologies Corp. Full time

About 2Prod2Prod Technologies Corp. supports the federal government in delivering secure, scalable cloud solutions that advance critical national missions.Position Summary2Prod Technologies Corp. is seeking a Site Reliability Engineer (SRE) with strong GitLab expertise to support and enhance enterprise platforms. This role will focus primarily on GitLab...
Senior Site Reliability Engineer

6 days ago

Remote, Oregon, United States Granicus Full time

The CompanyServing the People Who Serve the PeopleGranicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology...
Lead Site Reliability Engineer

4 days ago

Remote, Oregon, United States Canary Technologies Corp Full time

About Us Canary Technologies is changing the game for hotels with modern software powered by Canary's hospitality-specific AI platform. Canary is utilized by 20,000+ hoteliers in 100+ countries to equip hoteliers with the technology they need to work smarter and wow their guests. Major hotel brands such as Wyndham, Marriott, IHG, Four Seasons, Rosewood, and...
Senior Site Reliability Engineer

7 days ago

Remote, Oregon, United States Fortress Information Security Full time

Senior Site Reliability EngineerLocation: RemoteCompensation: $160, ,000 per year, depending on experience and qualifications.Employment Type: Full-TimeWhat you can expect as the Senior Site Reliability Engineer at Fortress…The Senior Site Reliability Engineer is responsible for ensuring the reliability, performance, and scalability of critical systems and...

Americas

Europe

Asia / Oceania

Africa

Global Head of Site Reliability Engineering