We have other current jobs related to this field that you can find below


  • Spring, Texas, United States HP Development Company, L.P. Full time

    Position Title: Cloud Platform Site Reliability Engineer Overview: The Cloud Platform Site Reliability Engineer will play a crucial role in ensuring the stability, scalability, and automation of the Gen AI Platform. This position involves working across major cloud providers such as AWS, Azure, and GCP to facilitate smooth deployment, operation, and...


  • Spring, United States WALT Labs Full time

    Job DescriptionJob DescriptionAt WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is...


  • Spring, United States WALT Labs Full time

    Job DescriptionJob DescriptionAt WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is...


  • Spring, Texas, United States HP Development Company, L.P. Full time

    Position: SRE - Cloud PlatformOverview:The SRE - Cloud Platform is tasked with ensuring the reliability, scalability, and automation of the Gen AI Platform. This role encompasses working across major cloud providers including AWS, Azure, and GCP to facilitate seamless deployment, operation, and monitoring of the platform.Key Responsibilities:• Enhance and...


  • Spring, United States CHEMICAL & INDUSTRIAL ENGINEER Full time

    Job DescriptionJob DescriptionDescription:C&I is looking for a talented and driven individual to join our team!About the opportunity: Staff Mechanical EngineerWe are currently seeking a Staff Mechanical Engineer in our office to help us achieve our company mission. The Mechanical Engineering Team is responsible for preparing technical reports, studies,...


  • Spring Grove, United States System One Full time

    Direct hire opportunity at an industrial manufacturing plant in Spring Grove. 100% onsite. Overview: Participates in daily standard work as part of the Spring Grove Business Plan and the Business System. Applies engineering principles in the elimination of equipment downtime, design for reliability of new installations or redesigns, fosters...

  • Reliability Engineer

    3 weeks ago


    Spring Grove, United States System One Holdings, LLC Full time

    Direct hire opportunity at an industrial manufacturing plant in Spring Grove. 100% onsite. Overview: Participates in daily standard work as part of the Spring Grove Business Plan and the Business System.Applies engineering principles in the elimination of equipment downtime, design for reliability of new installations or redesigns, fosters continuous...


  • Silver Spring, United States eSimplicity Full time

    Job DescriptionJob DescriptionDescription:About UseSimplicity is a modern digital services company that delivers innovative federal and commercial IT solutions designed to improve the health and lives of millions of Americans while defending our national interests. Our solutions and services improve healthcare for millions of Americans, protect our borders,...

  • Reliability Engineer

    3 weeks ago


    Big Spring, United States Delek US Full time

    Reliability Engineer - Rotating Equipment Location: Big Spring, TX, US, 79720 Are you looking for a career in a dynamic and innovative company that values versatility, growth, and teamwork? Look no further than Delek US Holdings! **WHAT IS DELEK? WHAT DO WE DO?** We are a boutique-sized diversified downstream energy company with a range of assets, including...

  • Reliability Engineer

    2 months ago


    Spring Grove, United States Pixelle Specialty Solutions Full time

    Job DescriptionJob DescriptionCompany DescriptionPixelle Specialty Solutions™ Spring Grove Papaer Mill, is the largest specialty paper company in North America, with fully integrated pulp and paper operations in Chillicothe, Ohio, Spring Grove, Pennsylvania, Stevens Point, Wisconsin and a coating operation in Fremont, Ohio. Supported by an experienced...

  • Reliability Engineer

    3 months ago


    Spring Grove, United States Pixelle Specialty Solutions Full time

    Job DescriptionJob DescriptionCompany DescriptionPixelle Specialty Solutions™ Spring Grove Papaer Mill, is the largest specialty paper company in North America, with fully integrated pulp and paper operations in Chillicothe, Ohio, Spring Grove, Pennsylvania, Stevens Point, Wisconsin and a coating operation in Fremont, Ohio. Supported by an experienced...


  • Spring, Texas, United States WALT Labs Full time

    About WALT Labs: We are dedicated to enabling organizations to harness the transformative capabilities of cloud technology, driving innovation and enhancing operational effectiveness. Position Overview: We are in search of a passionate and skilled Site Reliability Engineer (SRE) who thrives on technology, excels in troubleshooting, and is committed to...


  • Spring Grove, Pennsylvania, United States System One Full time

    Position Overview:We are seeking a dedicated Reliability Engineer to contribute to our industrial manufacturing operations. This role involves engaging in standard daily practices aligned with our business objectives and systems.Key Responsibilities:- Drive enhancements to our Reliability Initiatives across the facility.- Employ effective problem-solving...


  • Spring Grove, Pennsylvania, United States System One Full time

    Position Overview:We are seeking a dedicated Reliability Engineer to join our team at System One. This role is essential in enhancing the operational efficiency of our manufacturing processes.Key Responsibilities:1. Drive improvements in our Reliability Initiatives across the facility.2. Utilize established problem-solving methodologies to address...


  • Spring Grove, Pennsylvania, United States System One Full time

    Position Overview:We are seeking a dedicated Reliability Engineer to contribute to our industrial manufacturing operations. This role involves active participation in daily activities aligned with our business objectives and operational excellence.Key Responsibilities:1. Drive enhancements to our Reliability Initiatives across the facility.2. Employ...


  • Spring Grove, Pennsylvania, United States System One Full time

    Position Overview:We are seeking a dedicated Reliability Engineer to join our team at System One. This role is pivotal in ensuring the operational efficiency of our manufacturing processes.Key Responsibilities:1. Drive enhancements to our Reliability Initiatives across the facility.2. Employ effective problem-solving methodologies to address operational or...


  • Big Spring, Texas, United States Delek US Full time

    Position Overview:The Senior Maintenance Reliability Engineer plays a crucial role in ensuring the uninterrupted functionality and dependability of various mechanical systems within the organization. This position focuses on supporting scheduled maintenance and addressing operational equipment challenges, while applying engineering best practices and...


  • Big Spring, United States Delek US Full time

    Sr Reliability Engineer - Maintenance Location: Big Spring, TX, US, 79720 Are you looking for a career in a dynamic and innovative company that values versatility, growth, and teamwork? Look no further than Delek US Holdings! **What is Delek? What do we do?** We are a boutique-sized diversified downstream energy company with a range of assets, including...


  • Spring Valley, United States Strides Pharma Inc Full time

    Job DescriptionJob DescriptionReliability Technician / MechanicFull TimeNY-Chestnut Ridge Site, Chestnut Ridge, NY, USJob SummaryThe Technician III, Reliability performs facility reliability maintenance activities in accordance with approved procedures requiring limited supervision. Proficient at analyzing and troubleshooting the root cause of facility,...


  • Spring Valley, United States Strides Pharma Inc Full time

    Job DescriptionJob DescriptionReliability Technician / MechanicFull TimeNY-Chestnut Ridge Site, Chestnut Ridge, NY, USJob SummaryThe Technician III, Reliability performs facility reliability maintenance activities in accordance with approved procedures requiring limited supervision. Proficient at analyzing and troubleshooting the root cause of facility,...

Site Reliability Engineer

2 months ago


Spring, United States WALT Labs Full time

At WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is passionate about technology, excels in problem-solving, and is dedicated to providing unparalleled customer service. You will become the SME to the scale, resiliency and uptime of our own and the customer environments we support.

Role Summary

As a critical member of our team, the SRE will provide technical support and expertise to our managed services clients. This role involves diagnosing and resolving complex issues across diverse cloud environments and technologies, ensuring high performance and reliability. The ideal candidate is a tech enthusiast, eager to expand their knowledge and skills daily, committed to problem-solving and delivering customer-focused solutions within defined Service Level Agreement (SLA) guidelines.

Key Responsibilities:

  • Ensure high availability and reliability of software systems and infrastructure. Building out SLOs & SLAs and constantly improving reliability of systems.
  • Design, implement, and maintain monitoring and alerting systems to detect and address issues proactively, using mainly Datadog, GCP Cloud Monitoring and Pagerduty/Incident.io.
  • Debug and troubleshoot production issues across various customer environments, technology stacks, and cloud providers, primarily focusing on GCP and AWS.
  • Participate in an on-call rotation to respond to and resolve production incidents and conduct RCAs/Post Mortems to identify and address issues.
  • Develop and maintain runbooks and playbooks for incident response and troubleshooting.
  • Proactively optimize systems and application environments to identify bottlenecks and areas of improvements.
  • Conduct load testing and capacity planning to ensure systems can handle expected traffic and growth.
  • Develop and maintain IaC (Terraform) and Configuration Management (Ansible, Helm as examples)
  • Work closely with development teams to understand system architecture, identify potential reliability risks, and implement solutions.
  • Collaborate with operations teams to ensure smooth deployment and operation of software systems.
  • Master a broad range of technologies, including but not limited to VMs, container orchestration, networking, security, databases, data warehouses, serverless technologies, and storage solutions.
  • Proficiently deploy applications into Kubernetes using Helm, and manage Kubernetes administration and troubleshooting.
  • Provide direct support to clients during production outages, offering expert assistance to swiftly rectify issues, adhering to SLA expectations.
  • Diligently document solutions and processes, constantly seeking to improve knowledge, skills, and operational efficiency.
Requirements
  • Prefer candidate to be located in the Houston, TX area. We are accepting fully remote candidates within the United States.
  • 3+ years experience in an SRE role
  • From your core you understand how important SLOs, SLIs and KPIs are to the systems you support, using observability to be your grounding point on a daily basis.
  • Extensive knowledge of all major services in GCP (Cloud Run, BigQuery, GKE etc)
  • In-depth knowledge of all major services in AWS
  • Experience in setting up and managing monitoring solutions like Datadog, Google Cloud Operations Suite, Cloudwatch, Nagios, and Zabbix.
  • Familiarity with various CI/CD systems (Jenkins, Codefresh, GitLab CI, GitHub Actions, Argo CD).
  • Exceptional problem-solving capabilities, the ability to work under pressure, and strong critical thinking skills.
  • Be the voice and commander of incidents managed internally and externally to customers
  • A passion for technology and an unquenchable thirst for learning new skills.
  • A customer-focused mindset, dedicated to delivering the highest level of service.
Benefits
  • We cover 100% of your base medical plan
  • Dental, vision, disability, and life insurance available
  • Generous PTO policy that increases with longevity
  • 401k
  • Professional development and advancement opportunities
  • Bonus incentives