Principal Site Reliability Engineer

3 days ago


Plano, Texas, United States AT&T Full time
Job Title: Principal Site Reliability Engineer

AT&T is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering organization, you will be responsible for ensuring the high availability, reliability, and resiliency of our customer-facing experiences and shared omnichannel platforms.

Key Responsibilities:
  • Provide 24x7 Tier 1 support for customer-facing applications operating across eCommerce, Care, and Retail platforms built on microservices-based architecture on prem and in Cloud.
  • Manage escalated issues, incidents, and outages, triage, and drive prompt resolution.
  • Provide prompt visibility and status of escalated issues, incidents, and outages to leadership, business partners, and other key stakeholders.
  • Responsible for Site Reliability Engineering aspects such as developing functional and technical knowledgebase of the application, creation of run books, developing observability of the application in terms of alerts, monitoring, and dashboards that enable proactive incident and problem detection, triaging of the incidents, and helping Tier 2 conduct blameless post-mortems (after-action reviews).
  • Oversee daily T1 operations of premise and hosted applications and experiences, including data centers, compute, storage, data networks, monitoring, and NOC.
  • Work with Release Management related to upcoming changes to production to identify risks and mitigate them.
  • Work closely with Product Development and Tier 2 SRE teams to ensure Knowledge Transfer related to changes to the system well in advance of change getting operationalized.
  • Optimize the overall T1 on-call process and incident response workflow, including managing the team's on-call rotation, alert rules, communication methods, and incident response plans.
  • Provide metrics and status reports and review with leadership and stakeholder communities; establish processes surrounding metrics gather, reporting, and communication.
  • Staying current on feature development and how it could affect the system's overall reliability.
  • Assist in developing, publishing, and continually updating technology operations and support Standard Operating Procedures and detailed T1 documentation based on industry best practices.
Requirements:
  • Bachelor's degree in Computer Science or Engineering, or a related field.
  • 10+ years of demonstrated leadership experience building cross-organizational consensus.
  • 10+ years of demonstrated experience building and managing high-performing teams.
  • 10+ years of demonstrated experience with Incident Management, Incident response, and site reliability, managing Tier 1 Production Operations team.
  • 10+ Years of supporting large-scale eCommerce, Care, and Retail POS platforms and supporting applications in production in a leadership capacity.
  • Solid understanding and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
  • Hands-on experience with Customer Experience Analytics and Session-Based tools like Quantum Metric or Tealeaf.
  • Hands-on experience with Synthetic Monitoring tools like Catchpoint.
  • Experience working within scaled agile development teams.
  • Experience developing and implementing customer journey dashboards to enable proactive monitoring of customer experience availability.
  • Experience designing and managing a world-class technical operations organization including 24x7 support and outage/incident management.
  • Solid knowledge of Operations practices and demonstrated experience increasing Operational capability maturity within an organization.
  • Excellent communication and presentation skills; the ability to present complex technical information in a clear and concise manner.
  • Proficient at analyzing and interpreting large amounts of data with the capacity to synthesize information and translate into effective and actionable insights.
  • Exceptional organization and planning skills, strong analytical abilities, and process-driven orientation.
  • Unrelenting sense of customer-focus, urgency, and accuracy with an execution mindset.
  • Self-starter, creative, enthusiastic, innovative, and collaborative outlook.
Primary Technical Skills:
  • Java, Spring, WebLogic, AKS, and CI/CD tools, PL/SQL.
  • Microservices-based architecture using Java, J2EE, Jenkins, Maven, Linux, K8s, on both on-prem and in cloud.
  • Docker, Kubernetes, and Microsoft Azure Cloud, Unix.
  • Relational and NoSQL databases like Oracle and Cassandra.
  • Experience with visualization tools like Kibana and Grafana.
EFK Stack Experience Preferred:
  • Creation of Dashboards on Dynatrace, ELK, and Grafana.
Secondary Technical Skills (Optional, Yet Highly Desirable):
  • Salesforce Development (Apex, Visualforce, Lightning), Salesforce Sales Cloud and Service Cloud, MuleSoft, Dynatrace, and ELK (Elastic, Logstash, Kibana) for monitoring and logging.
  • Hands-on experience supporting Salesforce applications.
  • Sales and Service Cloud.
  • Experience with Marketing Cloud.
  • Experience within high-tech, software, and/or wireless/telecom industry highly desired.
  • Understanding of integration technologies and API Gateway, Mobile, and iOS technology stack; Experience with MuleSoft desired.
  • Solid technical background with understanding and/or experience in software development, web technologies, and customer communications such as email, SMS, and push notification.


  • Plano, Texas, United States AT&T Full time

    Job Title: Principal Site Reliability EngineerAT&T is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our Consumer Technology experience team, you will be responsible for delivering innovative and reliable technology solutions to power differentiated, simplified customer experiences.Key...


  • Plano, Texas, United States AT&T Full time

    Job Title: Principal Site Reliability EngineerAT&T is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our Consumer Technology experience team, you will be responsible for ensuring the high availability, reliability, and resiliency of our customer and agent-facing experiences and shared omnichannel...


  • Plano, Texas, United States Hispanic Technology Executive Council Full time

    About UsAt Hispanic Technology Executive Council, we are driven by a shared purpose to harness the power of technology to drive innovation and growth. Our team is dedicated to creating a workplace that is inclusive, diverse, and supportive of our employees' well-being.Job SummaryWe are seeking a highly skilled Site Reliability Engineer to join our team. As a...


  • Plano, Texas, United States Dexian Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Dexian. As a key member of our Incident Management team, you will be responsible for establishing frameworks, best practices, and scope management as we transition Incident Management into a Site Reliability Engineering team.Key...


  • Plano, Texas, United States Dexian Full time

    Job Title: Senior Site Reliability EngineerWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Dexian. As a key member of our Incident Management team, you will be responsible for establishing frameworks, best practices, and scope management as we transition Incident Management into a Site Reliability Engineering team.Key...


  • Plano, Texas, United States Toyota North America Full time

    About the RoleWe are seeking a highly skilled Director of Site Reliability Engineering to join our team at Toyota North America. As a key member of our organization, you will be responsible for building and leading a high-performing SRE team that ensures the reliability, performance, and scalability of our systems and applications.Key ResponsibilitiesSupport...


  • Plano, Texas, United States Capgemini Engineering Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Capgemini Engineering. As a Site Reliability Engineer, you will play a critical role in establishing and implementing a site reliability strategy for our clients in the MedTech industry.Key ResponsibilitiesDevelop and Implement SRE Strategy: Partner with our digital...


  • Plano, Texas, United States Toyota Full time

    About the RoleWe are seeking a highly skilled Director of Site Reliability Engineering to lead our new SRE team at Toyota Financial Services. As a key member of our organization, you will be responsible for building and establishing robust processes to ensure the reliability, performance, and scalability of our systems and applications.Key...


  • Plano, Texas, United States Amtex Systems Inc. Full time

    About the RoleAmtex Systems Inc. is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and stability of our applications and platforms.Key ResponsibilitiesChampion Site Reliability Culture: Demonstrate and promote site reliability culture and practices...


  • Plano, Texas, United States Pizza Hut Full time

    Job SummaryAs a Senior Manager, Engineering Site Reliability, you will lead a team of experienced engineers responsible for designing, implementing, and maintaining the infrastructure that supports our website, mobile app, and API. You will work closely with our Incident Management team to ensure that our infrastructure is reliable and scalable. You will...


  • Plano, Texas, United States Pizza Hut Full time

    About the RoleWe are seeking a highly skilled Senior Manager of Engineering Site Reliability to lead our SRE teams and drive the maturation of Site Reliability Engineering best practices and processes.Key ResponsibilitiesLead a team of 15 people with 2 managers and work closely with Product Engineering Managers and other PO Managers.Own the relationship with...


  • Plano, Texas, United States Toyota Full time

    About the RoleWe are seeking a highly experienced Director to lead our Site Reliability Engineering (SRE) team at Toyota. As a key member of our organization, you will be responsible for building and managing a high-performing team that ensures the reliability, performance, and scalability of our systems and applications.Key ResponsibilitiesTeam Leadership:...


  • Plano, Texas, United States Toyota North America Full time

    About UsToyota is a collaborative and respectful organization where innovation thrives. As a globally recognized brand, we are at the forefront of mobility solutions that enhance lives and exceed expectations. We are committed to nurturing diverse talent and providing opportunities for professional growth.Position OverviewWe are initiating a new Site...


  • Plano, Texas, United States Toyota North America Full time

    About UsToyota is a company built on collaboration and respect, where innovation meets high-quality solutions to enhance lives. We are committed to fostering a diverse workforce that embodies our values of dreaming, doing, and growing together.Position OverviewAs a pivotal member of Toyota Financial Services, you will take the lead in establishing a new Site...


  • Plano, Texas, United States Tyler Technologies Full time

    Job Title: Site Reliability Engineer, Technical and Cloud ServicesWe are seeking a highly skilled Site Reliability Engineer to join our Technical and Cloud Services team. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our infrastructure while driving automation and efficiency in our...


  • Plano, Texas, United States CarMax Full time

    **About the Role**CarMax is seeking a highly skilled Senior Manager Site Reliability Engineering to join our team. As a key member of our technology organization, you will play a crucial role in collaborating with business and technology counterparts to deliver solutions that align with our strategic objectives.**Key Responsibilities**Oversee the strategic...


  • Plano, Texas, United States Toyota North America Full time

    About UsToyota is a name synonymous with innovation and quality. Our culture is built on collaboration and respect, fostering an environment where creativity thrives. As a leader in the automotive industry, we are committed to shaping the future of mobility through cutting-edge solutions that enhance lives and provide exceptional experiences for our...


  • Plano, Texas, United States Toyota North America Full time

    About UsToyota is a company that embodies collaboration and respect, where innovation meets quality. We are committed to shaping the future of mobility with solutions that enhance lives and bring joy to our customers. We are on the lookout for talented individuals who are eager to contribute to our mission.At Toyota, we prioritize the growth of our...


  • Plano, Texas, United States Toyota North America Full time

    About UsToyota is a company that thrives on collaboration and respect, creating an environment where innovation flourishes. As a globally recognized brand, we are committed to advancing mobility through cutting-edge, high-quality solutions that enhance lives and bring joy to our customers. We seek diverse and talented individuals who are eager to contribute...


  • Plano, Texas, United States Compunnel Inc. Full time

    Position: Senior Production Support EngineerCompany: Compunnel Inc.Type: ContractLocation: RemoteOverview:The Senior Production Support Engineer will play a crucial role in maintaining the integrity and performance of enterprise applications. This position focuses on Engineering Operations and Production Support with a strong emphasis on Site Reliability...