Director, Site Reliability Engineering

4 weeks ago


Raleigh, United States Arch Capital Group Ltd. Full time
With a company culture rooted in collaboration, expertise and innovation, we aim to promote progress and inspire our clients, employees, investors and communities to achieve their greatest potential. Our work is the catalyst that helps others achieve their goals. In short, We Enable Possibility℠.

The Director, Site Reliability Engineering (SRE) is a pivotal role in the technology infrastructure team, responsible for ensuring the highest levels of reliability, scalability, and performance. This leadership role will set the vision and strategic direction for a skilled SRE team, aligning with the strategic objectives of the IT Infrastructure team, and fostering a culture of continuous improvement and operational excellence. This role will require a deep understanding of cloud-based infrastructure services and technologies, distributed systems, product delivery platforms, DevOps, automation, monitoring and a proactive approach to preventing and mitigating potential issues. The incumbent must also foster a culture of innovation and collaboration within a team of highly skilled engineers to meet the organization's evolving needs and deliver a superior digital experience to our product teams and customers.

*This is a Hybrid, Twice-a-week onsite role at our Greensboro and Raleigh offices.

Leadership & Strategy
  • Develop and implement a comprehensive SRE strategy that aligns with the IT Infrastructure team, IT and company objectives.
  • Lead the SRE team, setting clear goals and expectations, and providing mentorship and career development opportunities.
  • Collaborate with cross-functional teams to enhance system reliability and efficiency.
Technical Expertise
  • Oversee systems related to the availability of our infrastructure ecosystem, including cloud services and internal tooling.
  • Ensure the team's deep understanding and expertise in the system architecture, not limited to Kubernetes and OpenShift, but encompassing the entire product delivery stack.
Team Management
  • Manage the SRE team ensuring effective resource allocation and prioritization of POC's and initiative prioritization.
  • Drive the adoption of best practices in incident management and post-mortem analysis.
Incident Management
  • Be a leader in the response to high-impact infrastructure incidents, ensuring swift resolution and minimal disruption.
  • Implement proactive monitoring and measures to prevent future incidents and improve system resilience.
Communications
  • Articulate the value and accomplishments of the SRE team to stakeholders at all levels.
  • Foster a transparent communication environment within the team and across the organization.
  • Work closely with shared infrastructure services teams (including other SRE teams) within the corporation to establish a productive and transparent partnership and help establish consistent SRE and Infrastructure practices across the company.
Knowledge & Skills:
  • Proven expertise in large-scale complex system engineering and administration including cloud-based infrastructure in Microsoft Azure.
  • Strong leadership skills with the ability to inspire and motivate a high-performing team.
  • Excellent problem-solving abilities and data-driven approach to decision-making.
  • Technical leadership skills, including collaboration, technical problem-solving, and leading complex, mission critical initiatives.
  • In-depth understanding of Kubernetes concepts, components, and APIs with hands-on experience in orchestration of containerized applications using OpenShift (on-premises or in the cloud) Experience with OpenShift's added-value features such as advanced CI/CD pipelines for containerized product delivery.
  • Experience with GitHub, GitHub Actions, and/or Argo CD or similar technologies.
  • Strong background in working in an agile service delivery methodology arena focusing on iterative service improvement delivery.
Education & Experience:
  • A bachelor's degree in Computer Science, Engineering, or related field; a master's degree is preferred.
  • At least 10 years of experience in IT Infrastructure, system administration, or reliability engineering with a minimum of 5 years in a leadership role.
  • A track record of managing complex infrastructure initiatives and leading incident response efforts.


#LI-Hybrid
#LI-ZP1

Do you like solving complex business problems, working with talented colleagues and have an innovative mindset? Arch may be a great fit for you. If this job isn't the right fit but you're interested in working for Arch, create a job alert Simply create an account and opt in to receive emails when we have job openings that meet your criteria. Join our talent community to share your preferences directly with Arch's Talent Acquisition team.

  • Raleigh, United States Red Hat Full time

    About the Job. Red Hat is seeking a Site Reliability Engineer (SRE) to develop, scale, and operate our OpenShift managed cloud services. OpenShift is Red Hats enterprise Kubernetes distribution. As an SRE you will contribute to running OpenShift at Reliability Engineer, Liability, Reliability, Engineer, Reliability, Monitoring, Technology


  • Raleigh, United States Associates Systems LLC Full time

    Site Reliability Engineer Required Experience & Skills: Due to the work you’ll perform and interactions with DoD programs you will need to be a US citizen with the ability to obtain and maintain a DoD Secret Security Clearance BS in Computer Science, Engineering, Applied Mathematics, or a related technical field along with 7-9 years relevant work...


  • Raleigh, North Carolina, United States Associates Systems LLC Full time

    Essential Qualifications for Site Reliability Engineer:As part of your responsibilities and interactions with defense programs, you must be a US citizen capable of obtaining and maintaining a DoD Secret Security Clearance.A Bachelor’s degree in Computer Science, Engineering, Applied Mathematics, or a similar technical discipline is required, along with 7-9...


  • Raleigh, North Carolina, United States Veradigm® Full time

    Welcome to Veradigm. Our mission is to be the most trusted provider of innovative solutions that empower all stakeholders across the healthcare continuum to deliver world-class outcomes. Our vision is a connected community of health that spans continents and borders. With the largest community of clients in healthcare, Veradigm is able to deliver an...


  • Raleigh, United States Booz Allen Hamilton Full time

    The Opportunity: Everyone is trying to “harness the power of the cloud,” but not everyone knows how. As a site reliability engineer, you know how to build resilient platforms that meet customer needs and take advantage of the power of containerization both in the cloud and on premises. What if you could use your engineering skills to improve warfighter...


  • Raleigh, United States Veradigm Full time

    Welcome to Veradigm, where our Mission is transforming health, insightfully. Join the Veradigm team and help solve many of today's healthcare challenges being addressed by biopharma, health plans, healthcare providers, health technology partners, and the patients they serve. At Veradigm, our primary focus is on harnessing the power of research, analytics,...


  • Raleigh, United States Allscripts Full time

    Welcome to Veradigm, where our Mission is transforming health, insightfully. Join the Veradigm team and help solve many of today’s healthcare challenges being addressed by biopharma, health plans, healthcare providers, health technology partners, and the patients they serve. At Veradigm, our primary focus is on harnessing the power of research, analytics,...


  • Raleigh, North Carolina, United States Celonis Full time

    About Celonis: Celonis stands as the global frontrunner in Process Mining technology and is recognized as one of the fastest-growing SaaS companies worldwide. We are dedicated to harnessing the potential of data and intelligence to enhance productivity within business operations, and we invite you to be a part of this journey. Role Overview: Join a...


  • Raleigh, North Carolina, United States Ally Full time

    General InformationReference Number: 17885Remote Work: NoAbout Ally and Your CareerAt Ally Financial, our success is intrinsically linked to the success of our employees. We prioritize the well-being of our team members, recognizing their diverse interests, families, and aspirations. Our commitment to work-life balance, health, and inclusivity is reflected...


  • Raleigh, United States Veradigm® Full time

    Welcome to Veradigm! Our Mission is to be the most trusted provider of innovative solutions that empower all stakeholders across the healthcare continuum to deliver world-class outcomes. Our Vision is a Connected Community of Health that spans continents and borders. With the largest community of clients in healthcare, Veradigm is able to deliver an...


  • Raleigh, North Carolina, United States Citrix Systems Inc Full time

    Location: Fully on-site in Raleigh, NC.About Our TeamAre you passionate about working in a dynamic and agile environment? If you thrive in a setting that encourages innovation and collaboration, we want to hear from you. Our team is embarking on an exciting journey as we transition back to our roots, focusing on our SaaS offerings and positioning ourselves...


  • Raleigh, United States Delta System and Software Full time

    Job Title: Site Reliability Engineer Location: Cary, NC Day 1 onsite requirement Permanent hire - Must have good knowledge on Google Cloud Platform (GCP) - Required to have hands-on experience in defining and creating CUJ, SLO, SLI, and Error Budgeting based on NFR - S...


  • Raleigh, United States Cisco Full time

    Who We Are Today’s results-oriented business environment is more than that – it’s a period of disruption between the pandemic, global business change and internal process complexity. For us to focus on simplicity and the best customer experience, we need great talent and the right skillsets to be successful. This is now a mantra for our Cisco...


  • Raleigh, North Carolina, United States Biogen Idec Full time

    Job OverviewPosition SummaryThe Senior Reliability Engineer plays a crucial role in applying Reliability Engineering principles to enhance the design specifications and operational efficiency of essential assets throughout the organization. This position involves the development of analytical techniques to assess the reliability of components, machinery, and...


  • Raleigh, North Carolina, United States Biogen Idec Full time

    Job OverviewAbout the PositionThe Senior Reliability Engineer is responsible for implementing Reliability Engineering principles to enhance design specifications and operational efficiency of essential assets throughout the organization. This role involves developing analytical techniques to assess the reliability of components, machinery, and processes. The...


  • Raleigh, United States Cisco Full time

    Who We Are Today’s results-oriented business environment is more than that – it’s a period of disruption between the pandemic, global business change and internal process complexity. For us to focus on simplicity and the best customer experience, we need great talent and the right skillsets to be successful. This is now a mantra for our Cisco...


  • Raleigh, United States Biogen Idec Full time

    Job Description About This Role The Sr. Reliability Engineer I applies Reliability Engineering methodologies to optimize design requirements and performance of critical assets across the site. Originates and develops analysis methods for determining reliability of components, equipment and processes. Acquires data and analyzes the data. Prepares and...


  • Raleigh, North Carolina, United States Veradigm® Full time

    Welcome to Veradigm. Our mission is to be the most trusted provider of innovative solutions that empower all stakeholders across the healthcare continuum to deliver world-class outcomes. Our vision is a connected community of health that spans continents and borders. With the largest community of clients in healthcare, Veradigm is able to deliver an...

  • Reliability Engineer

    3 weeks ago


    Raleigh, United States Amentum Full time

    Amentum is seeking a Reliability Engineer to join our team in Winston Salem, NC! Typical work schedule is 1st Shift, 7:00 am – 3:30 pm; hours may vary based on business demand. Weekend hours may be scheduled to support our 24/7 operation. The Reliability Engineer acts as a Lean Maintenance SME and adds support to maintenance teams with development of...

  • Reliability Engineer

    2 months ago


    Raleigh, United States Amentum Full time

    Amentum is seeking a Reliability Engineer to join our team in Winston Salem, NC! Typical work schedule is 1st Shift, 7:00 am – 3:30 pm.; hours may vary based on business demand. Weekend hours may be scheduled to support our 24/7 operation. The Reliability Engineer acts as a Lean Maintenance SME and adds support to maintenance teams with development of...