Lead Site Reliability Engineer

1 week ago


Plano, TX, United States Toyota Full time
Overview

Who we are

Collaborative. Respectful. A place to dream and do. These are just a few words that describe what life is like at Toyota. As one of the world's most admired brands, Toyota is growing and leading the future of mobility through innovative, high-quality solutions designed to enhance lives and delight those we serve. We're looking for talented team members who want to Dream. Do. Grow. with us.

An important part of the Toyota family is Toyota Financial Services (TFS), the finance and insurance brand for Toyota and Lexus in North America. While TFS is a separate business entity, it is an essential part of this world-changing company- delivering on Toyota's vision to move people beyond what's possible. At TFS, you will help create best-in-class customer experience in an innovative, collaborative environment.

To save time applying, Toyota does not offer sponsorship of job applicants for employment-based visas or any other work authorization for this position at this time.

Who we're looking for

Toyota Financial Services is seeking a skilled and hands-on Lead Site Reliability Engineer - Cloud Platform to help scale and support the reliability, automation, and observability of our AWS infrastructure. In this role, you'll work closely with Cloud Platform Development, Production Engineering, and Incident Management teams to ensure our systems are resilient, self-healing, and ready for business-critical operations.

This role is ideal for someone who brings deep experience in cloud infrastructure and SRE best practices, enjoys solving complex reliability challenges, and is passionate about automation and continuous improvement.

What you'll be doing
  • Operate and optimize cloud-native infrastructure in AWS, with a focus on EKS, Lambda, CloudWAN, Systems Manager, and ECR
  • Build and maintain self-healing automation workflows to reduce manual toil and improve uptime
  • Create and manage AWS Systems Manager (SSM) Automation Documents for operational efficiency
  • Define and track SLIs/SLOs and error budgets to improve system reliability
  • Implement observability using Dynatrace and AWS-native tools (e.g., CloudWatch)
  • Develop and maintain infrastructure as code using Terraform for repeatable, scalable deployments
  • Enhance and support CI/CD pipelines using GitHub and Harness
  • Participate in incident management, on-call rotations, and lead blameless postmortems
  • Collaborate with cloud development teams to improve architecture, delivery, and system performance
  • Troubleshoot cloud infrastructure and networking issues and perform root cause analysis (RCA)
  • Continuously identify opportunities to improve reliability, performance, and operational processes
What you bring
  • 7+ years of experience in SRE, DevOps, or Cloud Infrastructure roles
  • Solid understanding of SRE principles: SLIs, SLOs, error budgets, incident response
  • Hands-on experience with AWS services such as EKS, Lambda, CloudWAN, EC2, S3, RDS, Redshift, Systems Manager
  • Strong knowledge of network architecture and protocols within AWS
  • Experience building automated remediation and self-healing systems
  • Proficiency with Terraform, Python, Bash, and infrastructure as code principles
  • Experience with CI/CD tools (GitHub, Harness) and observability platforms (Dynatrace, CloudWatch)
  • Familiarity with ITSM processes and cloud security best practices
  • Excellent troubleshooting, problem-solving, and collaboration skills
  • Ability to work independently and within a cross-functional team environment
Added bonus if you have
  • Bachelor's degree in Information Technology or related field
  • AWS Certifications (e.g., DevOps Engineer, Solutions Architect)
  • Experience with integration tools like MuleSoft, Apache Camel, or message streaming platforms
What we'll bring

During your interview process, our team can fill you in on all the details of our industry-leading benefits and career development opportunities. A few highlights include:
  • A work environment built on teamwork, flexibility, and respect
  • Professional growth and development programs to help advance your career, as well as tuition reimbursement
  • Team Member Vehicle Purchase Discount
  • Toyota Team Member Lease Vehicle Program (if applicable)
  • Comprehensive health care and wellness plans for your entire family
  • Toyota 401(k) Savings Plan featuring a company match, as well as an annual retirement contribution from Toyota regardless of whether you contribute
  • Paid holidays and paid time off
  • Referral services related to prenatal services, adoption, childcare, schools and more
  • Tax Advantaged Accounts (Health Savings Account, Health Care FSA, Dependent Care FSA)
  • Relocation assistance (if applicable)


#LI-DNI

Belonging at Toyota

Our success begins and ends with our people. We embrace all perspectives and value unique human experiences. Respect for all is our North Star. Toyota is proud to have 10+ different Business Partnering Groups across 100 different North American chapter locations that support team members' efforts to dream, do and grow without questioning that they belong.

Applicants for our positions are considered without regard to race, ethnicity, national origin, sex, sexual orientation, gender identity or expression, age, disability, religion, military or veteran status, or any other characteristics protected by law.

Have a question, need assistance with your application or do you require any special accommodations? Please send an email to talent.acquisition@toyota.com.

  • Plano, TX, United States E-Solutions Full time

    Role: Site Reliability Engineer Location: Plano, TX (Onsite) Job Description: Mandatory Skills : Integration Services SRE (skills - MuleSoft, Middleware, Camel, Tibco): Required experienced Integration Services SRE to ensure the reliability, scalability, and performance of enterprise integration platforms. The role involves managing and optimizing middleware...


  • Plano, TX, United States Optomi Full time

    Optomi, in partnership with a leading technology operations center, is looking for an SRE - Cloud Platform to join their team in Plano, TX.6 month contract to hireOnsite in Plano, TX 4x/week Position Summary: The SRE - Cloud Platform will be focused on operating and automating scalable, resilient AWS infrastructure. Working with core AWS services such as...


  • Plano, TX, United States Toyota Full time

    Overview Who we are Collaborative. Respectful. A place to dream and do. These are just a few words that describe what life is like at Toyota. As one of the world’s most admired brands, Toyota is growing and leading the future of mobility through innovative, high-quality solutions designed to enhance lives and delight those we serve. We’re looking for...


  • Plano, TX, United States Apex Informatics Full time

    Location: Plano, TX (Hybrid)3 days onsite 2 days remotelook for nearby Candidates Must have Skills:Need SRE mindset Preferred coming from development background AWS Splunk App Dynamics (good Monitoring background ) Job responsibilities • Troubleshoot technical issues (Java/J2EE, .Net, Cloud etc) or escalate and work with appropriate technology teams to...


  • Plano, TX, United States Procyon TS Full time

    In this role, you will: Drive solid system architecture and guide and mentor well-disciplined code development practices (i.e. Repository procedures for proper code check-out/in); Manage Safe feature branching strategies and versioning control; Develop proper work-flow for team code review and deliver well vetted and tested products. Will oversee/author...


  • Plano, TX, United States Procyon TS Full time

    In this role, you will: Drive solid system architecture and guide and mentor well-disciplined code development practices (i.e. Repository procedures for proper code check-out/in); Manage Safe feature branching strategies and versioning control; Develop proper work-flow for team code review and deliver well vetted and tested products. Will oversee/author...


  • Plano, TX, United States Procyon TS Full time

    In this role, you will: Drive solid system architecture and guide and mentor well-disciplined code development practices (i.e. Repository procedures for proper code check-out/in); Manage Safe feature branching strategies and versioning control; Develop proper work-flow for team code review and deliver well vetted and tested products. Will oversee/author...


  • Plano, TX, United States Diverse Lynx Full time

    SRE (Safety & Reliability Engineering) Consultant Role. Location Remote anywhere in the US: JD: On-prem infrastructure management Manage client's on-prem infrastructure. Maintain uptime, reliability, and readiness of on-prem engineering cloud spread across multiple data centers. Guard SLAs Guard service level agreements (SLAs) for critical...


  • Plano, TX, United States Diverse Lynx Full time

    SRE (Safety & Reliability Engineering) Consultant Role. Location Remote anywhere in the US: JD: On-prem infrastructure management Manage client's on-prem infrastructure. Maintain uptime, reliability, and readiness of on-prem engineering cloud spread across multiple data centers. Guard SLAs Guard service level agreements (SLAs) for critical...

  • Software Engineer

    2 weeks ago


    Plano, TX, United States Ziosk Full time

    Software Engineer (on site) - Plano Welcome to Ziosk, where we empower restaurants to focus on what matters most: the guest experience! Have you ever used a tablet to pay at a restaurant? We pioneered the pay-at-the-table concept and we're cooking up a plan to transform the restaurant industry. Our recipe for success has been adapting and growing to exceed...