Staff Software Engineer, Reliability

23 hours ago

New York, New York, United States Metropolis Full time $180,000 - $200,000

Who we are

Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. Today, we are reimagining parking to enable millions of consumers to just "drive in and drive out." We envision a future where people transact in the real world with a speed, ease and convenience that is unparalleled, even online. Tomorrow, we will power checkout-free experiences anywhere you go to make the everyday experiences of living, working and playing remarkable - giving us back our most valuable asset, time.

Who you are

We are building a hyperscaler company and need someone to own reliability across the entire Metropolis platform. As a Staff or Senior Software Engineer focused on Reliability, you'll establish and drive the comprehensive reliability practices that ensure system availability, resilience, and observability for our mission-critical mobility infrastructure serving millions of transactions.

This is your opportunity to build reliability from first principles – architecting failover systems, implementing chaos engineering practices, and improving the observability foundation that will enable Metropolis to scale to new markets while maintaining 99.9%+ uptime. You'll be the technical owner of our reliability posture, working on everything from multi-region failover architectures to incident response workflows to SLO-based alerting strategies.

Our platform handles real-time payment processing, customer authentication, and parking facility operations – systems that cannot go down. You'll tackle challenges like external service failover, dependency mirroring to prevent upstream outages, database replication and automatic promotion, and building the monitoring and alerting infrastructure that ensures we detect and respond to issues in minutes, not hours.

If you're energized by the challenge of ensuring system reliability at scale, building robust failover mechanisms, implementing comprehensive observability, and establishing the practices that prevent incidents before they occur, this role is for you. You'll work alongside highly technical teams across the organization, influencing architecture decisions and establishing reliability standards that affect every service we build.

What you'll do

Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact
Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution
Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages.

What we're looking for

8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling. Scala experience is a big plus
Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
Strong systems thinking with proven ability to design and implement large-scale, high-availability distributed systems that handle significant load
Observability expertise including hands-on production experience with metrics, logging, tracing, and alerting systems in high-load environments
Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns
Experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity – context engineering in particular
Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams local to New York City, Seattle, or Los Angeles area

While not required, these are a plus:

SRE or Reliability Engineering experience at companies known for operational excellence or high-growth startups where you built reliability practices from the ground up
Incident response leadership including experience building incident management processes, conducting blameless post-mortems, and driving MTTR reduction initiatives in production environments
Chaos engineering experience with tools like Chaos Monkey, Gremlin, or similar, including designing and executing game days and failure injection testing
Performance optimization experience with profiling, benchmarking, capacity planning, and system tuning at hyperscale including experience optimizing for high-throughput, low-latency systems
Open source contributions or technical blog writing that demonstrates depth of expertise in reliability engineering, distributed systems, or production operations

Our Stack

Languages + Frameworks: TypeScript, React, Scala (principally), Java (limited)
Datastores: MySQL, PostgreSQL, Snowflake
Cloud: AWS
Version control: Git & GitHub
AI Tooling: Copilot on GitHub
Observability: Datadog

When you join Metropolis, you'll join a team of world-class product leaders and engineers, building an ecosystem of technologies at the intersection of parking, mobility, and real estate. Our goal is to build an inclusive culture where everyone has a voice and the best idea wins. You will play a key role in building and maintaining this culture as our organization grows. The anticipated base salary for this position is $180,000.00 USD to $200,000.00 USD annually. The actual base salary offered is determined by a number of variables, including, as appropriate, the applicant's qualifications for the position, years of relevant experience, distinctive skills, level of education attained, certifications or other professional licenses held, and the location of residence and/or place of employment. Base salary is one component of Metropolis's total compensation package, which may also include access to or eligibility for healthcare benefits, a 401(k) plan, short-term and long-term disability coverage, basic life insurance, a lucrative stock option plan, bonus plans and more. #LI-CM1 #LI-Onsite

Metropolis values in-person collaboration to drive innovation, strengthen culture, and enhance the Member experience. Our corporate team members hold to our office-first model, which requires employees to be on-site at least four days a week, fostering organic interactions that spark creativity and connection

Metropolis may utilize an automated employment decision tool (AEDT) to assess or evaluate your candidacy for employment or promotion. AEDTs are used to assist in assessing a candidate's application relative to the required job qualifications and responsibilities listed in the job posting.

As part of this process, Metropolis retains data relevant to your candidacy, including personal information, for a period that is reasonably necessary for the use of the tool. If you are hired for the position, your data may become part of your employee records.

Metropolis Technologies is an equal opportunity employer. We make all hiring decisions based on merit, qualifications, and business needs, without regard to race, color, religion, sex (including gender identity, sexual orientation, or pregnancy), national origin, disability, veteran status, or any other protected characteristic under federal, state, or local law.

Staff Software Engineer

3 days ago

New York, New York, United States Smvsoft LLC Full time

Staff Software Engineer;Remote, United States (West Coast preferred);$170,000 - $230,000 base + benefitVisa Sponsorship is not available this timeKey Requirements7 years minimum professional software development experience in object oriented languages like Go or Java.Java is the primary backend language, deep Java experience is preferred.Go is a secondary...
Staff Software Engineer

2 days ago

New York, New York, United States Peregrine Full time

Backed by leading Silicon Valley investors, Peregrine helps the world's most complex organizations solve their hardest problems with unprecedented speed and accuracy. Our AI-enabled platform turns siloed and disconnected data into operational intelligence—instantly surfacing mission-critical information to empower better, faster decisions that improve...
Staff Software Engineer

1 week ago

New York, New York, United States Oscar Health Full time

Hi, we're Oscar. We're hiring a Staff Software Engineer to join our Engineering team.Oscar is the first health insurance company built around a full stack technology platform and a focus on serving our members. We started Oscar in 2012 to create the kind of health insurance company we would want for ourselves—one that behaves like a doctor in the...
Staff Site Reliability Engineer, Tech Lead

6 days ago

New York, New York, United States Unify Full time

About UnifyUnify was founded January 17th, 2023 by Austin Hughes and Connor Heggie. Prior to Unify, Austin led Ramp's growth product team focused on new customer acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from companies like Airbnb, Spotify, Bridgewater and LinkedIn.Our mission is to build the...
Staff Software Engineer

14 hours ago

New York, New York, United States Blackbird Full time

Blackbird.AI helps organizations discover emergent threats and stay one step ahead of real-world harm through our AI-powered Narrative and Risk Intelligence Platform. Our commitment is to prioritize safety and security, providing the tools to identify potential risks and ensure a safer environment proactively. No matter the job or where it's located, we're...
Senior/Staff Software Engineer

5 days ago

New York, New York, United States The Public Interest Company Full time

Proposed start date: 1/1/2026Reporting to: SVP, EngineeringWork location: Remote/WFH; New York, NY; South FloridaEmployment type: Full-time employeeAbout The CompanyThe Public Interest Company is a comprehensive solution for identifying and recovering third party liability for health plans, risk-bearing provider groups, and self-funded employers. We combine...
Staff Software Engineer

2 days ago

New York, New York, United States Kaizen Stackup Full time

Full-time Staff Software EngineerAbout UsWe are a dynamic and innovative technology company dedicated to creating cutting-edge software solutions that transform businesses and enhance user experiences. Our team of passionate professionals works collaboratively to tackle complex challenges and deliver high-quality products that make a difference in the...
Staff Software Engineer, Observability

5 hours ago

New York, New York, United States Astronomer Full time

Astronomer empowers data teams to bring mission-critical software, analytics, and AI to life and is the company behind Astro, the industry-leading unified DataOps platform powered by Apache Airflow. Astro accelerates building reliable data products that unlock insights, unleash AI value, and powers data-driven applications. Trusted by more than 800 of the...
Staff Software Engineer

11 hours ago

New York, New York, United States Fivetran Full time

From Fivetran's founding until now, our mission has remained the same: to make access to data as simple and reliable as electricity. With Fivetran, customer data arrives in their warehouses, canonical and ready to query, with no engineering or maintenance required. We're proud that more organizations continue to leverage our technology every day to become...
Staff Software Engineer

1 week ago

New York, New York, United States Gusto Full time

About GustoAt Gusto, we're on a mission to grow the small business economy. We handle the hard stuff—like payroll, health insurance, 401(k)s, and HR—so owners can focus on their craft and customers. With teams in Denver, San Francisco, and New York, we're proud to support more than 400,000 small businesses across the country, and we're building a...

Americas

Europe

Asia / Oceania

Africa

Staff Software Engineer, Reliability