Cloud Operations Reliability Engineer

2 weeks ago


Spring Texas, United States WALT Labs Full time

About WALT Labs: We are dedicated to enabling organizations to harness the transformative capabilities of cloud technology, driving innovation and enhancing operational effectiveness.

Position Overview: We are in search of a passionate and skilled Site Reliability Engineer (SRE) who thrives on technology, excels in troubleshooting, and is committed to delivering exceptional client service.

As a Subject Matter Expert (SME) in the scalability, resilience, and uptime of both our infrastructure and the environments we support, you will play a pivotal role in our operations.

Key Responsibilities:

  • Ensure the continuous availability and dependability of software systems and infrastructure.
  • Develop and refine Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to enhance system reliability.
  • Design, implement, and sustain monitoring and alerting frameworks to proactively identify and resolve issues, utilizing tools such as Datadog and GCP Cloud Monitoring.
  • Diagnose and troubleshoot production challenges across a variety of customer environments and technology stacks, with a primary focus on GCP and AWS.
  • Participate in an on-call rotation to address and resolve production incidents, conducting Root Cause Analyses (RCAs) and Post Mortems to identify and rectify issues.
  • Create and maintain comprehensive runbooks and playbooks for incident management and troubleshooting.
  • Proactively enhance systems and application environments to identify performance bottlenecks and opportunities for improvement.
  • Conduct load testing and capacity planning to ensure systems can accommodate anticipated traffic and growth.
  • Develop and manage Infrastructure as Code (IaC) using tools like Terraform and Configuration Management solutions such as Ansible and Helm.
  • Collaborate closely with development teams to understand system architecture, identify potential reliability risks, and implement effective solutions.
  • Work alongside operations teams to ensure seamless deployment and functioning of software systems.
  • Master a diverse range of technologies, including VMs, container orchestration, networking, security, databases, data warehouses, serverless technologies, and storage solutions.
  • Efficiently deploy applications into Kubernetes using Helm, and manage Kubernetes administration and troubleshooting.
  • Provide direct support to clients during production outages, delivering expert assistance to swiftly resolve issues while adhering to SLA expectations.
  • Document solutions and processes meticulously, continuously seeking to enhance knowledge, skills, and operational efficiency.

Qualifications:

  • Minimum of 3 years of experience in a Site Reliability Engineer role.
  • Strong understanding of SLOs, SLIs, and KPIs, leveraging observability as a foundational element in daily operations.
  • Extensive knowledge of major services in GCP (e.g., Cloud Run, BigQuery, GKE).
  • In-depth familiarity with key services in AWS.
  • Experience in establishing and managing monitoring solutions such as Datadog, Google Cloud Operations Suite, Cloudwatch, Nagios, and Zabbix.
  • Proficiency with various CI/CD systems (e.g., Jenkins, Codefresh, GitLab CI, GitHub Actions, Argo CD).
  • Exceptional problem-solving skills, with the ability to perform under pressure and strong critical thinking capabilities.
  • A genuine passion for technology and a relentless desire to acquire new skills.
  • A customer-centric approach, dedicated to delivering the highest level of service.

Benefits:

  • Comprehensive coverage of your base medical plan.
  • Options for dental, vision, disability, and life insurance.
  • Generous paid time off policy that increases with tenure.
  • 401k plan.
  • Opportunities for professional development and career advancement.
  • Bonus incentives.


  • Spring, Texas, United States HP Development Company, L.P. Full time

    Position Title: Cloud Platform Site Reliability Engineer Overview: The Cloud Platform Site Reliability Engineer will play a crucial role in ensuring the stability, scalability, and automation of the Gen AI Platform. This position involves working across major cloud providers such as AWS, Azure, and GCP to facilitate smooth deployment, operation, and...


  • Spring, Texas, United States HP Development Company, L.P. Full time

    Position: SRE - Cloud PlatformOverview:The SRE - Cloud Platform is tasked with ensuring the reliability, scalability, and automation of the Gen AI Platform. This role encompasses working across major cloud providers including AWS, Azure, and GCP to facilitate seamless deployment, operation, and monitoring of the platform.Key Responsibilities:• Enhance and...


  • Houston, Texas, United States SLB Full time

    Employer: Schlumberger Technology Corporation Full-time or part-time: Full-time Job title: Site Reliability Engineer Job Location: 1430 Enclave Parkway, Houston, TX 77077Job Description: Create ultra-scalable and highly reliable software systems through system design consulting, capacity planning, system health monitoring, and sustainable incident...


  • Austin, Texas, United States Procore Technologies Full time

    About the RoleWe are seeking a highly skilled Senior Database Reliability Engineer to join our Product & Technology Team at Procore Technologies. As a key member of our Data Division, you will play a critical role in building and maintaining our next-generation construction data platform.Key ResponsibilitiesDesign and implement distributed data storage...


  • Spring, United States WALT Labs Full time

    Job DescriptionJob DescriptionAt WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is...


  • Spring, United States WALT Labs Full time

    Job DescriptionJob DescriptionAt WALT Labs, we are committed to empowering businesses to leverage the transformative power of cloud technology, facilitating innovation and operational efficiency. Specializing in managed services across Google Cloud Platform (GCP) and Amazon Web Services (AWS), we seek a dedicated local Site Reliability Engineer (SRE) who is...


  • Texas, United States Addison Group Full time

    Position: Cloud Engineer – Google Cloud Platform (GCP)Overview: As a Cloud Engineer specializing in Google Cloud Platform (GCP), you will spearhead pivotal cloud projects, including the transition from AWS to GCP, the engineering of new cloud solutions, and the enhancement of our cloud infrastructure. Your role will be essential in managing cloud platform...


  • Texas, United States Addison Group Full time

    Job Title: Cloud Engineer – Google Cloud Platform (GCP)Location: RemoteSalary Range: $130K - $150K (No Bonus)Role Overview: As a Cloud Engineer specializing in Google Cloud Platform (GCP), you will spearhead several vital cloud projects, including transitioning from AWS to GCP, designing new systems, and enhancing our cloud infrastructure. Your leadership...


  • Austin, Texas, United States Visa Full time

    About the RoleWe are seeking a visionary Middleware Reliability Engineering Lead to drive the design, implementation, and operation of our next-generation middleware platform.Key ResponsibilitiesLead and inspire a high-performing team of engineers to deliver world-class middleware solutions.Design and implement a robust middleware support and deployment...


  • Texas, United States Addison Group Full time

    Job Title: Cloud Engineer – Google Cloud Platform (GCP)Location: RemoteSalary Range: $130K - $150K (No Bonus)Role Overview: As a Cloud Engineer specializing in Google Cloud Platform (GCP), you will spearhead several essential cloud projects, including the transition from AWS to GCP, engineering new cloud solutions, and enhancing our cloud infrastructure....


  • Texas, United States Wipro Full time

    About Wipro: Wipro Limited (NYSE: WIT, BSE: 507685, NSE: WIPRO) stands as a premier technology services and consulting firm dedicated to crafting innovative solutions that meet the intricate digital transformation demands of our clients. Our extensive portfolio encompasses consulting, design, engineering, operations, and cutting-edge technologies, empowering...


  • Texas, United States Wipro Full time

    About Wipro: Wipro Limited (NYSE: WIT, BSE: 507685, NSE: WIPRO) stands as a premier technology services and consulting firm dedicated to crafting innovative solutions that tackle clients' most intricate digital transformation challenges. Our extensive suite of capabilities in consulting, design, engineering, operations, and emerging technologies empowers...


  • Texas, United States Wipro Full time

    About Wipro: Wipro Limited (NYSE: WIT, BSE: 507685, NSE: WIPRO) stands as a premier technology services and consulting firm dedicated to crafting innovative solutions that meet the intricate digital transformation demands of our clients. We harness our extensive array of capabilities in consulting, design, engineering, operations, and cutting-edge...


  • Texas, United States Wipro Full time

    About Wipro: Wipro Limited (NYSE: WIT, BSE: 507685, NSE: WIPRO) stands as a prominent technology services and consulting firm dedicated to crafting innovative solutions that tackle the most intricate digital transformation challenges faced by clients. Our extensive portfolio encompasses consulting, design, engineering, operations, and cutting-edge...


  • Texas, United States Wipro Full time

    About Wipro: Wipro Limited (NYSE: WIT, BSE: 507685, NSE: WIPRO) stands as a premier technology services and consulting firm dedicated to crafting innovative solutions that meet the intricate digital transformation requirements of our clients. Our extensive range of capabilities encompasses consulting, design, engineering, operations, and cutting-edge...


  • Texas, United States TalentOla Full time

    Job Title: Terraform EngineerAbout the Role:We are seeking a highly skilled Terraform Engineer to join our team at TalentOla. As a Terraform Engineer, you will be responsible for designing, implementing, and maintaining cloud infrastructure using Terraform.Key Responsibilities:Design and implement cloud infrastructure using TerraformMaintain and optimize...


  • Austin, Texas, United States Visa Full time

    Position Overview:Are you driven by the challenge of creating and optimizing high-performance, resilient systems? Visa's Product Reliability Engineering (PRE) team is on the lookout for a forward-thinking Middleware Reliability Engineering (MWRE) Lead to spearhead the architecture, execution, and management of our innovative middleware platform.Key...

  • Cloud Software

    3 months ago


    Houston, Texas, United States SLB Full time

    A Cloud Software & Data Engineer is responsible for developing data engineering applications using third-party and in-house frameworks, leveraging a broad set of development skills that cover data engineering, data accessibility skillsets. The Cloud Software & Data Engineer is responsible for the complete software lifecycle - analysis, design, development,...


  • Silver Spring, Maryland, United States GAMA-1 Technologies Full time

    Cloud Computing Solutions SpecialistGAMA-1 Technologies, LLC is in search of a seasoned remote systems engineer to enhance and support infrastructure services tailored for high-performance computing applications operating within the OAR Cloud. This position necessitates a blend of cloud proficiency, web development capabilities, and a robust comprehension of...


  • Silver Spring, Maryland, United States GAMA-1 Technologies Full time

    Cloud Computing Solutions SpecialistGAMA-1 Technologies, LLC is in search of a highly skilled remote systems engineer to enhance and support infrastructure services tailored for high-performance computing applications operating within the OAR Cloud. This position demands a blend of cloud computing proficiency, web development capabilities, and a robust...