Director of Site Reliability Engineering, AI Infrastructure

1 month ago


Oklahoma City, Oklahoma, United States Oracle Full time

Job Description

  • MS or BS in Computer Science, or equivalent experience.
  • 5+ years of experience managing technology teams.
  • 10+ years of software engineering experience
  • Proven experience as a Director of Site Reliability Engineering or a similar leadership role, with a track record of successfully managing and scaling SRE teams.
  • Strong knowledge of cloud infrastructure, distributed systems, and network architecture.
  • Demonstrated ability to manage and prioritize multiple projects and initiatives in a fast-paced, dynamic environment.
  • Excellent problem-solving and troubleshooting skills, with the ability to analyze complex systems and identify areas for improvement.
  • Strong leadership and communication skills, with the ability to effectively collaborate with cross-functional teams and influence decision-making at all levels of the organization.

Preferred Qualifications

  • Experience in Nvidia training technologies (CUDA, NCCL).
  • Working familiarity with networking protocols (TCP/IP, UDP, HTTP) and standard network architectures.
  • Strong technical knowledge in distributed systems, high performance computing, and GPU systems.
  • Experience in AI model training infrastructure

Career Level - M4

Responsibilities

  • Lead and manage a global team of Site Reliability Engineers with 24x7 coverage, providing guidance, mentorship, and support to ensure high performance and professional growth.
  • Define and drive the SRE strategy, goals, and objectives in alignment with the overall business objectives.
  • Develop and maintain a culture of operational excellence and continuous improvement within the SRE team.
  • Collaborate with cross-functional teams, including engineering, product, and operations, to identify and address reliability and performance challenges.
  • Build and maintain proactive monitoring and alerting systems to quickly detect and resolve issues.
  • Implement and maintain robust incident management processes, including incident response, post-incident analysis, and remediation.
  • Lead and participate in capacity planning and performance optimization initiatives to ensure system scalability and availability.
  • Drive the adoption of best practices, automation, and standardization to streamline operations and enhance system reliability.
  • Stay up-to-date with industry trends and emerging technologies in site reliability engineering and leverage them to enhance our systems and processes.
Disclaimer

Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.

Range and benefit information provided in this posting are specific to the stated locations only

US: Hiring Range: from $120,300 to $291,900 per annum. May be eligible for bonus, equity, and compensation deferral.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business.

Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.

Oracle US offers a comprehensive benefits package which includes the following
  1. Medical, dental, and vision insurance, including expert medical opinion
  2. Short term disability and long term disability
  3. Life insurance and AD&D
  4. Supplemental life insurance (Employee/Spouse/Child)
  5. Health care and dependent care Flexible Spending Accounts
  6. Pre-tax commuter and parking benefits
  7. 401(k) Savings and Investment Plan with company match
  8. Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
  9. 11 paid holidays
  10. Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
  11. Paid parental leave
  12. Adoption assistance
  13. Employee Stock Purchase Plan
  14. Financial planning and group legal
  15. Voluntary benefits including auto, homeowner and pet insurance

The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

About Us

As a world leader in cloud solutions, Oracle uses tomorrow's technology to tackle today's problems. True innovation starts with diverse perspectives and various abilities and backgrounds.

When everyone's voice is heard, we're inspired to go beyond what's been done before. It's why we're committed to expanding our inclusive workforce that promotes diverse insights and perspectives.

We've partnered with industry-leaders in almost every sector-and continue to thrive after 40+ years of change by operating with integrity.

Oracle careers open the door to global opportunities where work-life balance flourishes. We offer a highly competitive suite of employee benefits designed on the principles of parity and consistency. We put our people first with flexible medical, life insurance and retirement options. We also encourage employees to give back to their communities through our volunteer programs.

We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by calling , option one.

Disclaimer:

Oracle is an Equal Employment Opportunity Employer*. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

Which includes being a United States Affirmative Action Employer*



  • Oklahoma City, Oklahoma, United States Thegradcafe Full time

    Position Overview:This is a full-time role for a Senior Site Reliability Engineer with a software development organization specializing in manufacturing and mechanical engineering. Opportunity:Join a distributed team dedicated to enhancing manufacturing processes and reducing production costs for physical products. Work Environment:This position is hybrid,...


  • Oklahoma City, Oklahoma, United States Ford Motor Company Full time

    Site Reliability Engineering at Ford Motor Company plays a critical role in maintaining and improving the reliability, scalability, and performance of our services. You will work closely with our development teams to build and maintain large-scale, distributed systems and ensure our products meet our high standards for availability and user...


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job Description Do you want to be a part of changing healthcare? Oracle is excited to be using our resources, knowledge, and expertise-as well as our successes in other industries-and applying them to healthcare to make a meaningful impact. As people, we all participate in healthcare, it's deeply personal, and we put the human at the center of each of our...


  • Oklahoma City, Oklahoma, United States Zoom Full time

    Site Reliability Engineer - WorkvivoWhat you can expectAs a Site Reliability Engineer, you will run the production environment by monitoring availability and taking a holistic view of system health. You will build software and systems to manage platform infrastructure and applications. Your work will help improve reliability, quality, and time-to-market of...


  • Oklahoma City, Oklahoma, United States International Association of Plumbing and Mechanical Officials Full time

    Position OverviewThe International Association of Plumbing and Mechanical Officials is seeking a dedicated Site Reliability Engineer to enhance our cloud service operations.Key ResponsibilitiesDevelop and maintain scripts and applications using Python and BASH to ensure optimal availability of cloud services.Review and design system architecture to achieve...


  • Oklahoma City, Oklahoma, United States Thegradcafe Full time

    Position OverviewThis is a full-time position for a Senior Site Reliability Engineer with a software development organization specializing in manufacturing and mechanical engineering. You will have the chance to be part of a distributed team dedicated to enhancing manufacturing processes and reducing production costs for tangible products. This role offers a...


  • Oklahoma City, Oklahoma, United States Selectek Full time

    Position Overview:Our client, Selectek, is seeking a talented Infrastructure Project Engineer to join their dynamic team. The ideal candidate will be adept at managing municipal projects with a high degree of autonomy. The team is particularly interested in candidates with experience in residential projects. Applicants should possess a solid background in...


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job DescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems....


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job DescriptionThis team will focus on product automation of Infrastructure, sustainability, and troubleshooting for Oracle Health. As a Site Reliability DevOps Engineer, you will be responsible for defining and deploying key services with deep focus on architecture, production operations, capacity planning, performance management, deployment, and release...


  • Oklahoma City, Oklahoma, United States Kiewit Corporation Full time

    Requisition ID: 172395Job Level: Mid LevelHome District/Group: South Central DistrictDepartment: Field OperationsMarket: TransportationEmployment Type: Full TimePosition OverviewAs a Site Engineering Specialist at Kiewit Corporation, you will be integral to the success of organizing, planning, and executing project engineering and control tasks. Your role...


  • Oklahoma City, Oklahoma, United States TEC Group International Ltd Full time

    As a Reliability Systems Engineer, your primary focus will be on providing engineering leadership for equipment integrity, inspection protocols, plant operations, and the implementation of recognized and accepted good engineering practices (RAGAGEP). You will serve as a crucial technical resource, addressing issues related to fixed plant equipment while...


  • Oklahoma City, Oklahoma, United States Paycom Full time

    Job OverviewLevelExperiencedJob LocationOklahoma City OfficePosition TypeFull TimeEducation LevelBachelor's DegreeTravel PercentageNoneJob CategoryDevelopmentRole SummaryThe Site Reliability Engineer will be dedicated to developing software tools, metrics, and processes that enhance the reliability of applications, websites, and systems in a production...


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job DescriptionOracle Cloud Infrastructure (OCI) AI Networking team is building an ultra-high performance network required to support AI/ML/HPC workloads. This is your opportunity to join the AI revolution and design a network which can scale from tens to thousands of GPU without compromising on performance. This team will deliver Network-as-a-Service that...


  • Oklahoma City, Oklahoma, United States Oracle Full time

    Job DescriptionOverview:As a Senior Manager Customer Success AI, you will use your existing product management skills and excellent knowledge of Artificial Intelligence & Oracle technologies to lead a group of developers and supporting ICs in defining and implementing AI tooling and efficiencies in our existing Customer Success systems as well a new tooling...


  • Oklahoma City, Oklahoma, United States American Council of Engineering Companies Full time

    Job DescriptionWe are seeking a highly experienced and innovative utilities professional to lead our engineering operations as the Executive Director of Engineering Operations.Key Responsibilities:Develop and implement strategic plans to ensure the effective management of our engineering operations.Oversee the development and implementation of policies and...


  • Oklahoma City, Oklahoma, United States Microsoft Corporation Full time

    Are you ready to tackle significant challenges and collaborate with a high-performing team? Join Microsoft Corporation in the Global Talent Acquisition Team as a Recruiter (Artificial Intelligence). If you are a seasoned talent acquisition expert, this is your opportunity to influence the future of Microsoft as part of our Engineering Talent Acquisition...


  • Oklahoma City, Oklahoma, United States Ensono Full time

    About the Role:As the Director of Operations and Monitoring, you will be responsible for driving the operational and engineering aspects of our services for Ensono's clients. This includes building and leading a team of 30 associates globally, with expertise in monitoring Operating Systems, Databases, Network, Virtual and Storage Backup components.You will...


  • Oklahoma City, Oklahoma, United States Downing Full time

    Product Reliability EngineerKey Attributes: Resourceful, Client-Centric, ResponsibleOverview of the Role: The Product Reliability Engineer, a vital member of our engineering division, focuses on enhancing current offerings to boost dependability, minimize expenses, or introduce new features.Key Responsibilities: Implement engineering enhancements and...


  • Oklahoma City, Oklahoma, United States Mohammed VI Polytechnic University Full time

    Located in the heart of the Kingdom's capital at Technopolis, University Mohammed VI Polytechnic (UM6P) is an internationally-oriented higher education establishment, committed to an educational system based on the highest standards of teaching and research in fields crucial to the sustainable economic development of Morocco and Africa.Ai movement, the...


  • Oklahoma City, Oklahoma, United States PAYCOM PAYROLL LLC Full time

    The Reliability Systems Engineer will focus full-time on developing software solutions, metrics, and processes that enhance the dependability of applications, websites, and systems in active use. The primary duty of the Reliability Systems Engineer is to maintain the integrity, functionality, and reliability of applications and websites.KEY...