Current jobs related to Principal Site Reliability Engineer - plano - AT&T


  • Plano, Texas, United States AT&T Full time

    Job SummaryWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at AT&T. As a key member of our Consumer Technology experience team, you will be responsible for delivering innovative and reliable technology solutions to power differentiated, simplified customer experiences.The ideal candidate will have a strong background in...


  • Plano, United States Headway Tek Inc Full time

    Role: Site Reliability EngineerLocation: Plano, TXJob Type: Fulltime* must be solid with SW development. Fluent with Python and solid experience with Docker, KubernetesWhat you will be doingSr Site Reliability Engineer with expertise in AWS Cloud Engineering, 5G RAN Engineering, Network Design and Engineering, 5G Core Engineering. As an integral part of the...


  • plano, United States Headway Tek Inc Full time

    Role: Site Reliability EngineerLocation: Plano, TXJob Type: Fulltime* must be solid with SW development. Fluent with Python and solid experience with Docker, KubernetesWhat you will be doingSr Site Reliability Engineer with expertise in AWS Cloud Engineering, 5G RAN Engineering, Network Design and Engineering, 5G Core Engineering. As an integral part of the...

  • Platform Engineer

    4 weeks ago


    Plano, Texas, United States Capital One Full time

    Job Title: Platform Engineer - Site Reliability EngineeringCapital One is seeking a highly skilled Platform Engineer to join our Site Reliability Engineering (SRE) team. As a Platform Engineer, you will be responsible for designing, developing, and deploying scalable and reliable cloud-based systems.Key Responsibilities:Collaborate with product owners to...


  • Plano, Texas, United States Pizza Hut Full time

    We're on a mission to build the most loved global brand and the fastest growing in every country. To achieve this, we need a talented Site Reliability Engineer II to join our dynamic Pizza Hut Incident Management team.As a Site Reliability Engineer II, you will establish frameworks, best practices, and scope management as we transition Incident Management...


  • Plano, Texas, United States MSRCOSMOS Full time

    Job DescriptionMSRCOSMOS is seeking a highly skilled Senior Site Reliability Engineer to join our team. As a key member of our Site Reliability and Observability Engineering team, you will be responsible for ensuring the reliability and performance of our network and applications.Key Responsibilities:Design and implement automation solutions to improve...


  • Plano, Texas, United States Toyota North America Full time

    About the RoleWe are seeking a highly skilled and experienced Director of Site Reliability Engineering to lead our new SRE team at Toyota North America. As a key member of our organization, you will be responsible for building and managing a high-performing team that ensures the reliability, performance, and scalability of our systems and applications.Key...


  • Plano, Texas, United States Toyota Full time

    Job SummaryWe are seeking a highly skilled Director of Site Reliability Engineering to lead our new SRE team at Toyota Financial Services. As a key member of our organization, you will be responsible for building and managing a team of engineers to ensure the reliability, performance, and scalability of our systems and applications.Key...


  • Plano, Texas, United States Toyota Motor Sales, U.S.A., Inc. Full time

    Job DescriptionToyota Financial Services is seeking a Director of Site Reliability Engineering to spearhead the launch of a new SRE team. The successful candidate will be responsible for building the team from the ground up and establishing robust processes to ensure the reliability, performance, and scalability of our systems and applications.Key...


  • Plano, Texas, United States Toyota Full time

    About ToyotaToyota is a world-renowned brand that is growing and leading the future of mobility through innovative, high-quality solutions designed to enhance lives and delight those we serve.Job SummaryWe are seeking a highly skilled and experienced Director of Site Reliability Engineering to spearhead our new SRE team. As a key member of our team, you will...


  • Plano, Texas, United States Toyota North America Full time

    About the RoleWe are seeking a highly experienced Site Reliability Engineering Director to lead our new SRE team at Toyota North America. As a key member of our organization, you will be responsible for building and managing a high-performing team that ensures the reliability, performance, and scalability of our systems and applications.Key...


  • Plano, United States Cognizant Full time

    About Cognizant’s Digital Engineering Practice: At Cognizant Digital Engineering, a small cross functional team comprised of a Product Manager, an Architect, Full-Stack Developers, UI/UX designers and Big Data analysts builds higher quality software faster siloed individuals working independently. Small, nimble engineering teams generate collective...


  • Plano, United States Cognizant Full time

    About Cognizant’s Digital Engineering Practice: At Cognizant Digital Engineering, a small cross functional team comprised of a Product Manager, an Architect, Full-Stack Developers, UI/UX designers and Big Data analysts builds higher quality software faster siloed individuals working independently. Small, nimble engineering teams generate collective...


  • Plano, United States Cognizant Full time

    About Cognizant’s Digital Engineering Practice: At Cognizant Digital Engineering, a small cross functional team comprised of a Product Manager, an Architect, Full-Stack Developers, UI/UX designers and Big Data analysts builds higher quality software faster siloed individuals working independently. Small, nimble engineering teams generate collective...


  • Plano, Texas, United States Capital One Full time

    Job Title: Lead Platform Engineer, Site Reliability EngineeringCapital One is seeking a highly skilled Lead Platform Engineer, Site Reliability Engineering to join our team. As a key member of our engineering organization, you will be responsible for designing, developing, and deploying scalable and reliable cloud-based systems.Key...


  • Plano, Texas, United States Bank of America Full time

    Senior Site Reliability EngineerAt Bank of America, we are committed to delivering exceptional customer experiences through the power of technology. As a Senior Site Reliability Engineer, you will play a critical role in ensuring the stability and performance of our cloud-based identity systems.Key Responsibilities:Collaborate with cross-functional teams to...


  • Plano, Texas, United States Capital One Full time

    Job SummaryWe are seeking a highly skilled Senior Platform Engineer, Site Reliability Engineering to join our team at Capital One. As a key member of our engineering community, you will play a critical role in designing, developing, testing, and implementing technical solutions using a full-stack of development tools and technologies.Key Responsibilities*...


  • Plano, Texas, United States Capital One Full time

    About the Role:Capital One is seeking a skilled Platform Engineer to join our Site Reliability Engineering team. As a Platform Engineer, you will be responsible for designing, developing, and implementing technical solutions to ensure the reliability, scalability, and performance of our cloud-based infrastructure.Key Responsibilities:Work with product owners...


  • Plano, United States PlektonLabs Full time

    DIRECT HIRE ONLY (NO C2C)Company DescriptionPlektonLabs enables businesses to future-proof their systems by providing customized and creative solutions, transforming enterprise architecture. With a dedicated team of tech veterans, we help organizations conceptualize and realize their plans, effortlessly navigating through the industry. No project is too...


  • plano, United States PlektonLabs Full time

    DIRECT HIRE ONLY (NO C2C)Company DescriptionPlektonLabs enables businesses to future-proof their systems by providing customized and creative solutions, transforming enterprise architecture. With a dedicated team of tech veterans, we help organizations conceptualize and realize their plans, effortlessly navigating through the industry. No project is too...

Principal Site Reliability Engineer

2 months ago


plano, United States AT&T Full time

NOTE: This position is "hybrid" 3 days a week onsite in our Plano, Texas location. (this is NOT remote)


This is for a very high level Principal Site Reliability Engineer.


Join AT&T and reimagine the communications and technologies that connect the world. Our Consumer Technology experience team is delivering innovative and reliable technology solutions to power differentiated, simplified customer experiences. Bring your bold ideas and fearless risk-taking to redefine connectivity and transform how the world shares stories and experiences that matter. When you step into a career with AT&T, you won’t just imagine the future-you’ll create it.


The Principal System Engineering of Operations Tier 1 is responsible for helping lead a team of people dedicated to proactively ensuring high availability, reliability and resiliency of AT&T's customer & agent facing experiences and shared omnichannel platforms.


Responsibilities

  • Provide 24x7 Tier 1 support for customer & agent facing applications operating across eCommerce, Care, & Retail platforms built on microservices based architecture on prem & in Cloud including SaaS: Salesforce, Salesforce Marketing Cloud, MuleSoft, etc.
  • Management of escalated issues, incidents and outages, triage and driving prompt resolution
  • Provide prompt visibility and status of escalated issues, incidents and outages to leadership, business partners and other key stakeholders.
  • Responsible for Site Reliability Engineering aspects such as developing functional and technical knowledgebase of the application, creation of run books, developing observability of the application in terms of alerts, monitoring and dashboards that enable proactive incident and problem detection, triaging of the incidents and helping Tier 2 conduct blameless post-mortems (after action reviews).
  • Oversee daily T1 operations of premise and hosted applications and experiences, including data centers, compute, storage, data networks, monitoring and NOC.
  • Work with Release Management related to upcoming changes to production to identify risks and mitigate them.
  • Work closely with Product Development & Tier 2 SRE teams to ensure Knowledge Transfer related to changes to the system well in advance of change getting operationalized.
  • Optimize the overall T1 on-call process and incident response workflow, including managing the team’s on-call rotation, alert rules, communication methods and incident response plans.
  • Provide metrics and status reports and review with leadership and stakeholder communities; establish processes surrounding metrics gather, reporting and communication.
  • Staying current on feature development and how it could affect the system’s overall reliability.
  • Assist in developing, publishing and continually updating technology operations and support Standard Operating Procedures and detailed T1 documentation based on industry best practices.
  • Provide technical leadership with great communication skills, with an ability to create and organize self-motivated team.
  • Conduct rigorous due diligence on all plans.
  • Drive team engagement. Motivate individuals and teams beyond current scope of influence.
  • Champion and facilitate breakthrough solutions. Take appropriate, intelligent risks.
  • Create, enable and cultivate a culture of responsibility and accountability.
  • Lead by example and operate with transparency, integrity and respect.


Qualifications:

A suitable candidate for this position must possess the following applicable knowledge, skills and abilities. In addition, be able to demonstrate and provide applicable examples to support his/her competencies.

  • Bachelor's degree in Computer Science or Engineering, or a related field
  • 10+ years of demonstrated leadership experience building cross-organizational consensus
  • 10+ years of demonstrated experience building and managing high-performing teams
  • 10+ years of demonstrated experience with Incident Management, Incident response, and site reliability, managing Tier 1 Production Operations team
  • 10+ Years of supporting large scale eCommerce, Care, & Retail POS platforms & supporting applications in production in a leadership capacity.
  • Solid understanding and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
  • Hands-on experience with Customer Experience Analytics & Session Based tools like Quantum Metric or Tealeaf
  • Hands-on experience with Synthetic Monitoring tools like Catchpoint
  • Experience working within scaled agile development team.
  • Experience developing and implementing customer journey dashboards to enable proactive monitoring of customer experience availability.
  • Experience designing and managing a world-class technical operations organization including 24x7 support and outage/incident management.
  • Solid knowledge of Operations practices and demonstrated experience increasing Operational capability maturity within an organization.
  • Excellent communication and presentation skills; the ability to present complex technical information in a clear and concise manner.
  • Proficient at analyzing and interpreting large amounts of data with the capacity to synthesize information and translate into effective and actionable insights.
  • Exceptional organization and planning skills, strong analytical abilities, and process-driven orientation
  • Unrelenting sense of customer-focus, urgency and accuracy with an execution mindset
  • Self-starter, creative, enthusiastic, innovative and collaborative outlook


Primary technical skills should include:

  • Java, Spring, WebLogic, AKS, and CI/CD tools, PL/SQL
  • Microservices based architecture using Java, J2EE, Jenkins, Maven, Linux, K8s, on both on-prem and in cloud.
  • Docker, Kubernetes and Microsoft Azure Cloud, Unix
  • Relational & NoSQL databases like Oracle & Cassandra
  • Experience with visualization tools like Kibana and Grafana. EFK stack experience preferred.: (Hands-on experience is must)
  • Creation of Dashboards on Dynatrace, ELK and Grafana. (Hands-on experience is must)


Secondary technical skills (optional, yet highly desirable):

  • Salesforce Development (Apex, Visualforce, Lightning), Salesforce Sales Cloud & Service Cloud, MuleSoft, Dynatrace and ELK (Elastic, Logstash, Kibana) for monitoring and logging
  • Hands on experience supporting Salesforce applications
  • Sales & Service Cloud
  • Experience with Marketing Cloud
  • Experience within high tech, software and/or wireless/telecom industry highly desired
  • Understanding of integration technologies and API Gateway, Mobile and iOS technology stack; Experience with MuleSoft desired
  • Solid technical background with understanding and/or experience in software development, web technologies and customer communications such as email, SMS and push notification.