Platform Owner AIOps SRE

2 days ago


Waltham MA United States National Grid Full time

About us

Every day, we deliver safe and secure energy to homes, communities, and businesses, connecting people to the energy they need for their lives. Our expertise and track record position us uniquely to shape the sustainable future of our industry as the pace of change accelerates.To succeed, we must anticipate customer needs, reduce energy delivery costs, and pioneer flexible energy systems. This requires delivering on our promises and seeking opportunities for growth.

In IT and Digital, we collaborate closely with the diverse energy businesses within the National Grid group, revolutionizing operations through technology. Embracing Agile methodologies and Digital mindsets, we drive efficiency and bring new capabilities to internal and external customers as we lead the charge towards a carbon-free future.

Our work is critical, as National Grid powers millions of homes and businesses in the UK and US, and the technology we employ is vital to this task. The successful applicant for this position will play a crucial role in our mission, supported by our multicultural, customer-centric global team, with opportunities for professional development.

National Grid is hiring a Platform Owner AI OPS SRE. This position offers remote flexibility, with the requirement that candidates reside in one of the following states: New York (NY), New Jersey (NJ), Massachusetts (MA), Connecticut (CT), Vermont (VT), Rhode Island (RI), Maine (ME), or New Hampshire (NH).

Job Purpose

As a Platform Owner of AI Ops and SRE, your primary objective is to design and oversee the implementation of complex systems that meet functional and non-functional requirements. You will play a key role in developing system design policies, standards, and innovation processes specific to AI Ops and SRE. Additionally, you will actively monitor emerging technologies and assess their potential impact on the organization. Your responsibilities will include driving the strategic vision for AI Ops and SRE within the platform, ensuring alignment among stakeholders, and promoting a cohesive approach to AI Ops and SRE implementation.

Key Accountabilities

As a Platform Owner of AI Ops and SRE, your primary responsibility is to develop comprehensive strategies for implementing AI Ops and SRE practices within the organization. This involves understanding business requirements, assessing technical capabilities, and identifying areas where AI and automation can be leveraged to enhance reliability, performance, and operational efficiency.

Your key responsibilities as a Platform Owner of AI Ops and SRE include:
* Developing AI Ops and Site Reliability Engineering (SRE) Strategies: You will be responsible for developing strategies that incorporate AI Ops and SRE practices within the data center and cloud domain. This involves understanding business requirements, assessing technical capabilities, and identifying opportunities to leverage AI and automation for improved reliability and performance.
* Designing Cloud Architecture Solutions: You will design cloud and on-premise architecture solutions that integrate AI technologies and SRE principles. This includes designing scalable and resilient systems, implementing monitoring and alerting mechanisms, and ensuring high availability and fault tolerance.
* Collaborating with Development and Operations Teams: You will work closely with development and operations teams to provide technical guidance and ensure the successful implementation of AI Ops and SRE practices. This involves reviewing designs, providing recommendations, and promoting best practices for building and operating reliable and efficient cloud-based applications.
* Implementing AI-Driven Monitoring and Analytics: You will implement AI-driven monitoring and analytics solutions within the cloud domain. This includes leveraging machine learning and data analysis techniques to identify and predict system anomalies, performance bottlenecks, and potential failures.
* Establishing Incident Response and Resolution Processes: You will define and establish incident response and resolution processes aligned with SRE practices. This includes setting up incident management frameworks, defining escalation paths, and implementing effective incident response strategies to minimize downtime and ensure quick resolution.
* Driving Continuous Improvement and Optimization: You will drive continuous improvement and optimization efforts within the cloud domain. This involves analyzing system metrics, conducting root cause analysis, and implementing changes to optimize cloud performance, reliability, and efficiency. Automation and self-healing mechanisms will be employed to enhance system resilience and reduce manual intervention.
* Staying Current with Industry Trends: It is crucial to stay updated with the latest industry trends, technologies, and best practices related to AI Ops, SRE, cloud, and on-premises computing. This includes attending conferences, participating in relevant communities, and continuously learning and exploring new tools and techniques to enhance the organization's AI Ops and SRE capabilities within the cloud and on-premise domain.
* Creating and delivering traceable and auditable customer success metrics for the platform services/products.
* Monitoring and analyzing platform performance metrics and reporting on the overall health of the platform to senior leadership.
* Managing the infrastructure platform within budget guardrails to ensure alignment with company priorities and goals.
* Collaborating with Transversal Teams to align Non-Functional Requirements (NFRs) and prioritize them jointly.

Requirements

* Bachelor's degree in a relevant discipline, or an equivalent combination of education, training, and experience.
* 7 - 10 years of related experience.
* Foster one-team culture with ownership, collaboration, and empathy across functions.
* 5 or more years of people management experience with relevant industry and professional certifications.
* Manage risks and communicate project status, issues, and risks clearly and timely to stakeholders.
* Collaborate with colleagues and suppliers in different time zones and communicate effectively with both technical and business people.
* 3-5 years Experience with cloud platforms such as Azure preferred, Amazon Web Services (AWS), or Google Cloud Platform (GCP) is essential for managing and optimizing cloud-based infrastructure.
* Containerization and Orchestration: Proficiency in containerization technologies like Docker and container orchestration platforms like Kubernetes is important for deploying and managing containerized applications at scale.
* Infrastructure-as-Code (IaC): Knowledge of infrastructure-as-code tools such as Terraform or AWS CloudFormation is valuable for automating the provisioning and management of infrastructure resources.
* Monitoring and Observability: Familiarity with monitoring and observability tools like Prometheus, Grafana, ServiceNow, ELK Stack (Elasticsearch, Logstash, Kibana), or Splunk is crucial for monitoring system performance, analyzing logs, and troubleshooting issues.
* Continuous Integration and Continuous Deployment (CI/CD): Experience with CI/CD pipelines and related tools such as GitHub, GitLab CI/CD
* Configuration Management: Knowledge of configuration management tools like Ansible, Puppet, or Chef is valuable for managing and automating configuration changes across infrastructure and application environments.
* Proficiency in incident management tools like ServiceNow, PagerDuty, VictorOps, or ServiceNow, as well as collaboration platforms like Slack or Microsoft Teams, is essential for effective incident response and coordination.
* Understanding of networking concepts, protocols, and security best practices is important for managing network infrastructure, implementing secure access controls, and ensuring system and data protection.
* Scripting and Programming Languages: Familiarity with scripting languages like Python, Bash, or PowerShell, as well as programming languages like Java, Go, or Ruby, enables automation and customization of various tasks and workflows.
* Database Technologies: Knowledge of database technologies such as MySQL, PostgreSQL, MongoDB, or Redis is valuable for managing and optimizing database systems and ensuring data integrity and availability.

Your Rewards

Rewarding work and a collaborative, team-oriented culture are just the beginning. Review our digital benefit guide at ngbenefitslivebrighter.com for full details and descriptions.

More Information

#LI-RK1 #LI-HYBRID

Salary

New England: $179k - $211k a year

Downstate NY: $192k - $226k a year

Upstate NY: $160k - $188k a year

This position has a career path which provides for advancement opportunities within and across bands as you develop and evolve in the position; gaining experience, expertise and acquiring and applying technical skills. Candidates will be assessed and provided offers against the minimum qualifications of this role and their individual experience.

National Grid is an equal opportunity employer that values a broad diversity of talent, knowledge, experience and expertise. We foster a culture of inclusion that drives employee engagement to deliver superior performance to the communities we serve. National Grid is proud to be an affirmative action employer. We encourage minorities, women, individuals with disabilities and protected veterans to join the National Grid team.

#J-18808-Ljbffr

  • Waltham, MA, United States National Grid Full time

    About usEvery day, we deliver safe and secure energy to homes, communities, and businesses, connecting people to the energy they need for their lives. Our expertise and track record position us uniquely to shape the sustainable future of our industry as the pace of change accelerates.To succeed, we must anticipate customer needs, reduce energy delivery...


  • waltham, United States National Grid Full time

    About usEvery day, we deliver safe and secure energy to homes, communities, and businesses, connecting people to the energy they need for their lives. Our expertise and track record position us uniquely to shape the sustainable future of our industry as the pace of change accelerates.To succeed, we must anticipate customer needs, reduce energy delivery...


  • San Francisco, CA, United States salesforce Full time

    To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts. Job Category: Product About Salesforce We’re Salesforce, the Customer Company, inspiring the future of business with AI + Data + CRM. Leading with our core values, we help companies across every industry...


  • San Francisco, CA, United States Salesforce Full time

    About Salesforce We’re Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too — driving your performance and career growth, charting new paths,...


  • San Francisco, CA, United States salesforce.com, inc. Full time

    To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts. Job Category : Product About Salesforce We're Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze...


  • Boston, MA, United States Tbwa ChiatDay Inc Full time

    Director, Cloud Engineering and SRE At Brightcove, we take pride in being the world’s most trusted streaming technology company. Delivering video experiences to hundreds of millions of users globally is no small feat, and we’re looking for a seasoned Cloud Engineering leader to help grow that by an order of magnitude. Cloud Platforms, and how they are...


  • Cambridge, MA, United States Cognizant Full time

    About Us: Cognizant (Nasdaq: CTSH) is one of the world's leading professional services companies, redefining clients' business, operating, and technology models for the digital era. Our unique industry-based, consultative approach helps clients envision, build, and run more innovative and efficient businesses. We help our clients modernize technology,...


  • waltham, United States National Grid Full time

    About usEvery day, we deliver safe and secure energy to homes, communities, and businesses, connecting people to the energy they need for their lives. Our expertise and track record position us uniquely to shape the sustainable future of our industry as the pace of change accelerates.  To succeed, we must anticipate customer needs, reduce energy delivery...


  • McLean, VA, United States Capital One Full time

    Center 3 (19075), United States of America, McLean, Virginia Lead Platform Engineer, Site Reliability Engineering (SRE) Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll be part of a big group of...


  • Waltham, United States National Grid USA Full time

    About us Every day, we deliver safe and secure energy to homes, communities, and businesses, connecting people to the energy they need for their lives. Our expertise and track record position us uniquely to shape the sustainable future of our industry as the pace of change accelerates. To succeed, we must anticipate customer needs, reduce energy delivery...


  • CAMBRIDGE, MA, United States Cognizant Full time

    About Us:Cognizant (Nasdaq: CTSH) is one of the world's leading professional services companies, redefining clients' business, operating, and technology models for the digital era. Our unique industry-based, consultative approach helps clients envision, build, and run more innovative and efficient businesses. We help our clients modernize technology,...


  • Waltham, United States Massachusetts Medical Society Full time

    Site Reliability Engineering (SRE) ManagerCategory Information TechnologyJob Location Waltham, MassachusettsTracking Code 1119Position Type Full-Time/RegularThe Massachusetts Medical Society (MMS) is the statewide professional association for physicians and medical students, supporting 25,000 members. We are dedicated to educating and advocating for the...


  • Charlotte, NC, United States KTek Resourcing Full time

    Title: Observability SRE Lead/Sr. Dev & Observability SRE ArchitectLocation: Phoenix, AZ / Plano, TX (Hybrid)Primary SkillsAzure,GCPCI/CD, DevOpsObservability tools (Grafana/Prometheus, LOKI, Mimir,Tempo)AKS,GKEJob Description1.Grafana OSS Stack for observability (Mimir,Loki,Tempo, Grafana Alloy)2.Azure/GCP hands-on with details around pulling observability...

  • Lead SRE Engineer

    2 months ago


    Plano, TX, United States Cognizant Full time

    About Cognizant’s Digital Engineering Practice: At Cognizant Digital Engineering, a small cross functional team comprised of a Product Manager, an Architect, Full-Stack Developers, UI/UX designers and Big Data analysts builds higher quality software faster siloed individuals working independently. Small, nimble engineering teams generate collective...


  • McLean, VA, United States GameStop Full time

    Overview Design. Disrupt. Repeat. Be an agent of change on a team committed to achieving client-focused, mission-driven excellence. Steampunk is looking for an experienced Site Reliability Engineer with an appetite for taking on new challenges. Who We Are Steampunk is the explosive collision of human-centered design and traditional government...


  • McLean, VA, United States Root Center For Advanced Recovery Full time

    Overview Design. Disrupt. Repeat. Be an agent of change on a team committed to achieving client-focused, mission-driven excellence. Steampunk is looking for an experienced Site Reliability Engineer with an appetite for taking on new challenges. Who We Are Steampunk is the explosive collision of human-centered design and traditional government contracting. An...


  • Chicago, IL, United States Tbwa ChiatDay Inc Full time

    Site Reliability Engineer (SRE) - Mandarin Speaking Location: Chicago-HQ/Hybrid Chowbus is a SaaS (Software as a Service) company that began as an online platform for food ordering, payment, and delivery. The company has since shifted its focus to providing an all-in-one POS (point-of-sale) system tailored to the evolving needs of the restaurant industry....


  • Seattle, WA, United States Apple Inc. Full time

    Site Reliability Engineer (SRE) - Object Storage People at Apple don’t just build products — they craft the kind of experience that have revolutionized entire industries. The diverse collection of our people and their ideas inspire innovation in everything we do. Imagine what you could do here! Join Apple, and help us leave the world better than we found...


  • Chicago, IL, United States CME Group Full time

    Description Position Overview: Data System Reliability Engineer (dSRE) CME Group: Where Futures Are MadeCME Group is the world's leading and most diverse derivatives marketplace. But who we are goes deeper than that, here you can impact markets worldwide, transform industries and build a career shaping tomorrow. We invest in your success and you own it,...


  • Boston, MA, United States Capital One Full time

    Sr. Distinguished Engineer - Platform Operations Center 1 (19052), United States of America, McLean, Virginia At Capital One, we believe that AI and machine learning represent the biggest opportunity in financial services today, and is a chance to revolutionize the industry with more real-time personalized experiences than it was ever possible. Our mission...