We have other current jobs related to this field that you can find below


  • Dallas, United States Themesoft Inc. Full time

    Role: Site Reliability EngineerLocation: Dallas, TexasFull TimeSalary: $140,000 + Bonus+ BenefitsThe Site Reliability Engineer is a fundamental piece of the Site Reliability Engineering team. Site Reliability Engineering is accountable for the availability, reliability, and performance of the services and platforms in a highly transactional 24x7 environment....


  • Dallas, United States Themesoft Inc. Full time

    Role: Site Reliability EngineerLocation: Dallas, TexasFull TimeSalary: $140,000 + Bonus+ BenefitsThe Site Reliability Engineer is a fundamental piece of the Site Reliability Engineering team. Site Reliability Engineering is accountable for the availability, reliability, and performance of the services and platforms in a highly transactional 24x7 environment....


  • Dallas, United States Themesoft Inc. Full time

    The Site Reliability Engineer is a fundamental piece of the Site Reliability Engineering team. Site Reliability Engineering is accountable for the availability, reliability, and performance of the services and platforms in a highly transactional 24x7 environment. The roleMonitor application performance, take steps to improve overall application performance...


  • Dallas, United States Diamondpick Full time

    Hi,Hope you are doing well.Please find the below JD.Title: SRE EngineerLocation: Dallas, TX Type of Hire: Full TimeJob Description:The Site Reliability Engineer is a fundamental piece of the Site Reliability Engineering team. Site Reliability Engineering is accountable for the availability, reliability, and performance of the services and platforms in a...


  • Dallas, United States Appspace Full time

    Your Role as a Site Reliability Engineer: Our Cloud Operations team seeks a Site Reliability Engineer who is passionate about problem-solving, automating, and maintaining Appspace’s Cloud Platform to support the needs of our Engineering and Customer Care teams. The ideal candidate will see manual work as an opportunity to exercise automation, will...


  • Dallas, United States VDart Inc Full time

    Job DescriptionJob DescriptionTitle: SRE / Site Reliability EngineerLocation: TX/Dallas Hybrid/OnsiteDuration: 1 YearSkillsHelp build a Site Reliability Engineering culture by sharing your best practices, approaches, documentation, and code with other engineering teams.Apply automation and software to any tasks or parts of the system that would benefit from...


  • Dallas, United States Diverse Lynx Full time

    Job Title: Site Reliability Engineer Location: Dallas, TX//Onsite Duration: Full Time-Only Job Description Responsible for ensuring the reliability of systems, minimizing downtime, and maintaining service-level objectives (SLOs). Developing, automating, and implementing automation tools to streamline processes, deploy applications, and manage...


  • Dallas, United States Motion Recruitment Full time

    Job Description Our client, an independent services business that focuses on delivering a unified operating model for cloud, data, IoT and managed services, is looking for a Site Reliability Engineer who will be accountable for the availability, reliability, and performance of the services and platforms in a highly transactional 24x7 environment. This...


  • Dallas, United States Signify Health Full time

    How will this role have an Impact? Join Signify Health's vibrant Site Reliability Engineering team as a Site Reliability Engineer. We're seeking passionate individuals from diverse technical backgrounds. Reporting to the Manager of Site Reliability Engineering, we offer a collaborative environment that values each team member's unique contribution and...


  • Dallas, United States Saxon Global Full time

    As a member of the Production Support/SRE team you will work cross-functionally amongst a variety of teams and be a core contributor in every significant engineering service or solution that we deliver to our stakeholders. You'll excel if you have enthusiasm for digging deep, and a flare for technical communication, prioritization . You will work directly...


  • Dallas, United States Signify Health Full time

    Job DescriptionJob DescriptionHow will this role have an Impact?Join Signify Health's vibrant Site Reliability Engineering team as a Site Reliability Engineer. We're seeking passionate individuals from diverse technical backgrounds. Reporting to the Manager of Site Reliability Engineering, we offer a collaborative environment that values each team...


  • Dallas, United States Dice Full time

    Dice is the leading career destination for tech experts at every stage of their careers. Our client, Galaxy i Technologies, Inc., is seeking the following. Apply via Dice today! Site Reliability Engineer Location: Dallas TX Onsite Full Time Skill: Site Reliability Engineer Ensures supported applications are functioning and available by minimizing downtime...


  • Dallas, United States VIZIO Full time

    About the Team: VIZIO releases firmware & software for millions of customers in a time efficient manner. Our goal is to maintain 99.9% uptime for our customers. We are seeking a Site Reliability Engineer to join our expanding organization. The Site Reliability Engineer will report to the Manager, DevOps Security and will play a crucial role in enhancing the...


  • Dallas, United States Motion Recruitment Partners LLC Full time

    Our client, a large manager service provider focused on digital solutions and transformation, is looking for a Site Reliability Engineer to join their team. This person will be responsible for monitoring their application performance, making suggestions to improve performance and stability, and taking the lead on implementing those improvements. The ideal...


  • Dallas, United States Diverse Lynx Full time

    Role : Site Reliability Engineer/Devops Engineer Location : Dallas TX (Onsite) Duration: Full-time Job Description Skill: Site Reliability Engineer Ensures supported applications are functioning and available by minimizing downtime and maximizing performance. Provides technical expertise to the stakeholders and end user ensuring continuous...


  • Dallas, United States JPMorganChase Full time

    Job Description There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Enterprise technology, Infrastructure platforms team, you...


  • Dallas, Texas, United States JPMorganChase Full time

    Job Description There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Enterprise technology, Infrastructure platforms team, you will solve...


  • Dallas, United States Apple Full time

    Site Reliability Engineering (SRE) Manager - Apple Service Engineering Austin, Texas, United States Software and Services Imagine what you could do here. At Apple, great ideas have a way of becoming great products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish! Join...


  • Dallas, United States Motion Recruitment Full time

    Our client, a large manager service provider focused on digital solutions and transformation, is looking for a Site Reliability Engineer to join their team. This person will be responsible for monitoring their application performance, making suggestions to improve performance and stability, and taking the lead on implementing those improvements. The ideal...


  • Dallas, United States Motion Recruitment Full time

    Job Description Our client, a large manager service provider focused on digital solutions and transformation, is looking for a Site Reliability Engineer to join their team. This individual will oversee the functionality and performance of their application, coming up with ideas to make it more stable and efficient, and leading the implementation of those...

Manager, Site Reliability Engineering

2 months ago


Dallas, United States Redwood Software Full time

Important: We have been made aware that individuals are posing as Redwood recruiters in an attempt to deceive candidates into sharing personal information. Redwood employees will only contact you from an “@redwood.com” email domain. If you have questions or suspect an email is fraudulent, please contact us at recruitment@redwood.com . For this role, we are considering applicants in the United States or the United Kingdom. OUR MISSION At Redwood Software we unleash human potential. We empower our customers with lights-out automation for their mission-critical business processes. Redwood Software is the leader in full stack automation for mission-critical business processes. With the first SaaS-based composable automation platform specifically built for ERP, we believe in the transformative power of automation. Our unparalleled solutions empower organizations to orchestrate, manage and monitor their workflows across any application, service or server – in the cloud or on premise – with confidence and control. CORE VALUES One Team. One Redwood Make Your Own Weather Obsess over Customer Success Work the Problem Be Curious Own the Outcome Respect Each Other YOUR IMPACT The SRE Manager is responsible for leading the Site Reliability Engineering (SRE) team, owning and optimizing the incident management process, and ensuring the reliability and performance of the company's SaaS products. This role requires strong leadership, excellent communication skills, and the ability to work collaboratively across various departments to achieve organizational goals. The ideal candidate will have a deep understanding of cloud infrastructure, incident response, and customer support. Leadership and Team Management: Lead and manage the SRE team, providing guidance, training, and support. Own and lead the incident management process, ensuring incidents are managed effectively from detection to resolution. Establish and maintain incident management policies and procedures. Act as the primary point of contact for all incident-related activities, ensuring clear communication with stakeholders. Manage and build a global team to scale with the growing demand of the SaaS product offering. Incident Response and Resolution: Oversee the day-to-day management of alerts, system checks, and issue escalation. Ensure the team provides 24x7 on-call support for critical SaaS events and emergencies. Coordinate and lead incident response efforts, ensuring timely and effective resolution of incidents. Perform Root Cause Analysis (RCA) and take corrective actions to prevent recurrence. Ensure Mean Time to Resolution (MTTR) targets for escalated tickets are met by implementing effective escalation procedures and monitoring performance. Service Level Management: Define, monitor, and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure reliability and performance standards are met. Regularly review and analyze service performance data to identify areas for improvement and ensure compliance with SLAs. Process Improvement and Automation: Proactively develop and implement monitoring and alerting systems within the EKS/K8S ecosystem. Enhance infrastructure health by implementing automated checks and remediation scripts. Continuously improve deployment code and automate manual tasks to streamline operations. Collaboration and Communication: Work closely with Support, Customer Success, Migration, and Professional Services teams to ensure exceptional customer service. Maintain clear and detailed documentation of issues, remediation steps, and RCAs. Work closely with management, product architects, and product team leads to highlight product issues impacting our SaaS offering quality, performance, and SLAs. Communicate effectively with customers and internal teams, ensuring transparency and understanding of incident impacts and resolutions. Innovation and Technology Integration: Stay current with new technologies and integrate them into the cloud infrastructure to enhance performance and reliability. Deploy applications to EKS/K8s clusters using Terraform and Helm and maintain existing infrastructure under Docker Swarm. YOUR EXPERIENCE Proven experience as an AWS Cloud Engineer with hands-on expertise in EKS, Terraform, and Helm. Strong background in Docker and Docker Swarm. In-depth knowledge of AWS IAM roles, policies, and CloudWatch logs. Proficient in Linux environments and scripting languages such as Bash and Python. Excellent understanding of web technologies, REST APIs, and DevSecOps principles. Experience with monitoring solutions like Grafana and Prometheus. Exceptional oral and written communication skills. Strong customer-facing communication skills, capable of effectively explaining issues and RCAs. Experience in product/application support for SaaS-based products. Understanding of APIs, databases, systems architecture, and design. AWS Certified Solutions Architect. Working knowledge of IaC, CI/CD and observability Desired Attributes Ability to work independently and collaboratively within a team. Strong problem-solving skills and the ability to troubleshoot issues in production environments. Customer-focused mindset, always considering the impact on customers when planning deployments and updates. Ability to lead and motivate a team, fostering a culture of continuous improvement and excellence. This role requires a proactive leader who can manage and optimize the incident management process, ensuring the highest level of support and service for our SaaS product offerings.

#J-18808-Ljbffr