Platform Reliability Engineer
12 hours ago
About
Hydra Host is a fast-growing baremetal HPC infrastructure company building a backbone that our customers rely on to provide rock-solid infrastructure to host their cloud systems and perform mission-critical training and inference. Uptime, reliability, and observability are critical to everything we do. We're hiring a Platform Reliability Engineer to own QA systems and processes, monitoring, and backend service delivery, to ensure our systems meet internal and customer SLAs. This person will work with devops, datacenter, device, and marketplace teams to ensure these goals are in sync and systems in place across all domains.
We support mission‑critical tooling for our enterprise customers. Our success depends on delivering exceptional customer support and operational excellence. We've integrated several dozen datacenters and we are expanding quickly to disrupt the traditional cloud compute model.
We are a small team who relies on lean and automated processes to manage enterprise-scale infrastructure without the need for enterprise-scale red tape and armies of technicians. This role is ideal for someone who thrives at the intersection of automation engineering, systems reliability, tooling development, and operational excellence.
What You'll Do
As our Platform Reliability Engineer, you will:
- Design, deploy, and maintain QA systems used by our development teams to test integration and live system responses across full-stack deployments in local, live, and ephemeral environments.
- Evaluate and integrate monitoring and QA tools to find the right tools for the job.
- Create a unified monitoring platform and processes that datacenter and device teams will integrate to monitor their components (live servers, lifecycle, networks, power, etc.).
- Maintain monitoring processes and dashboards to provide complete visibility into the health, performance, and reliability of our CI systems, software deployments, and testing platforms.
- Create and maintain a systems test suite, in collaboration with our product managers, to validate marketplace changes against all business functions in live and ephemeral QA environments.
- Integrate all fore-mentioned systems to create holistic platform health statistics reporting.
- Design disaster-recovery processes in collaboration with devops.
- Ensure we are meeting uptime SLAs across all platform deployments.
- Work with datacenter and device teams to define service-level indicators (SLIs), service-level objectives (SLOs), and SLAs.
- Establish observability standards across the stack: logs, metrics, traces, and alerts, and actionable on-call playbooks.
- Automate everything from monitoring setups to incident responses to eliminate manual toil and increase reliability.
- Drive incident response, root cause analysis, and post‑mortems. Guide incident turn-around into tooling and process improvements.
- Establish the monitoring infrastructure and dashboards that enable everyone — from engineers to execs — to know what's going on.
- Act as the reliability partner to engineering teams: review systems for reliability concerns, help design QA requirements and testing, and help teams meet reliability targets.
Required Qualifications
- 5–8+ years of experience in Reliability Engineering, DevOps, or infrastructure roles focused on large-scale, high-uptime production environments.
- Deep familiarity with monitoring and observability tooling: you've implemented and managed systems, esp. Prometheus, Grafana, and Zabbix.
- Strong experience with service orchestration in mutli-region environment (Nomad, Kubernetes, cloud VMs, distributed databases).
- Track record of managing production system uptime and SLAs and building tools to support it.
- Experience writing and reviewing post-mortems and using those findings to drive improvements in tools and process.
- Proficient with scripting and programming languages (Python, Go, BASH, etc.) for automating operational tasks.
- Strong proficiency with infrastructure as code and devops workflows.
- Experience with distributed tracing, log aggregation, and alert tuning.
- Passion for building systems that fail gracefully, alert correctly, and empower others to operate confidently.
- Excellent communication skills: you can write clear documentation, drive incident reviews, and communicate reliability risks to technical and non-technical stakeholders.
Preferred Qualifications
- Experience working with baremetal infrastructure.
- Experience working in a high-stakes, strong-ownership environment such as a start-up.
- Background in performance engineering or capacity planning.
- Familiarity with compliance or customer-facing SLAs (SOC2, uptime guarantees, etc.).
- Implemented SRE practices at a large scale.
What We Offer
- Competitive compensation: base salary + performance bonus + equity.
- The opportunity to build, own, and set the standards for the monitoring and testing infrastructure for a high-growth infrastructure company.
- Exposure to high-performance computing and state-of-the-art GPU environments.
- A core role in ensuring our systems are reliable, observable, and meet customer SLAs.
- Work on real-world distributed systems challenges in production.
- Remote work environment with a strong culture of ownership and autonomy.
- No red tape: find the right solution, work with the team, get feedback, and get the job done.
-
Platform Engineer
2 weeks ago
Miami, Florida, United States Mambu Full timeWho we areJoin the fintech revolution with Mambu, the leading SaaS cloud banking platform. We're on a mission to make banking better for a billion people. Explore exciting career opportunities and help shape the future of financial services. Learn more here.We are at the forefront of the fintech revolution, enabling our customers to build innovative and...
-
Quality Assurance Engineer
2 weeks ago
Miami, Florida, United States Instasks App platform Full timeInstasks App is a Professional Concierge Service. The app provides top-tiered professionals and clients with an online platform. Our unique approach to building an App is to give the client and the provider instant bookings and an easy process of all services: For example, client requests for quick tasks and large projects. Providers receive custom requests...
-
Desktop Support Engineer
3 days ago
Miami, Florida, United States Instasks App platform Full timeInstasks App is a Professional Concierge Service. The app provides top-tiered professionals and clients with an online platform. Our unique approach to building an App is to give the client and the provider instant bookings and an easy process of all services: For example, client requests for quick tasks and large projects. Providers receive custom requests...
-
Customer Reliability Engineer
2 days ago
Miami, Florida, United States Cisco Full timeStrong preference for candidates based on the West Coast, with the ability to work in the Pacific Time ZoneApplication window is expected to close on 11/25/2025. However, the job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.Meet The TeamIsovalent is the company founded by the creators of...
-
Technical Product Manager, eCommerce Platform
2 weeks ago
Miami, Florida, United States SANDBX Full time $120,000 - $180,000 per yearA US-based company (a global leader in cruise vacationing industry) is looking for a Technical Product Manager to work on its customer-facing web-based solution used by hundreds of thousand users daily.We're looking for a Technical Product Manager (TPM) to own the eCommerce Platform as a product. You'll partner with the Sr. Engineering Manager (Sr EM) and a...
-
Senior Platform Engineer
5 days ago
Miami, Florida, United States Mambu Full timeWho we areJoin the fintech revolution with Mambu, the leading SaaS cloud banking platform. We're on a mission to make banking better for a billion people. Explore exciting career opportunities and help shape the future of financial services. Learn more here.About the teamYou'll be joining a dynamic and collaborative end-to-end AWS platform team. We take...
-
Auto Mechanic
1 hour ago
Miami, Florida, United States Instasks App platform Full timeInstasks App is a Professional Concierge Service. The app provides top-tiered professionals and clients with an online platform. Our unique approach to building an App is to give the client and the provider instant bookings and an easy process of all services: For example, client requests for quick tasks and large projects. Providers receive custom requests...
-
Lab Assistant
2 weeks ago
Miami, Florida, United States Instasks App platform Full timeInstasks App is a Professional Concierge Service. The app provides top-tiered professionals and clients with an online platform. Our unique approach to building an App is to give the client and the provider instant bookings and an easy process of all services: For example, client requests for quick tasks and large projects. Providers receive custom requests...
-
Construction Laborer
5 days ago
Miami, Florida, United States Instasks App platform Full timeInstasks App is a Professional Concierge Service. The app provides top-tiered professionals and clients with an online platform. Our unique approach to building an App is to give the client and the provider instant bookings and an easy process of all services: For example, client requests for quick tasks and large projects. Providers receive custom requests...
-
Database Developer
1 day ago
Miami, Florida, United States Instasks App platform Full timeInstasks App is a Professional Concierge Service. The app provides top-tiered professionals and clients with an online platform. Our unique approach to building an App is to give the client and the provider instant bookings and an easy process of all services: For example, client requests for quick tasks and large projects. Providers receive custom requests...