Site Reliability Engineer, VP

3 weeks ago


Dallas, Texas, United States The Goldman Sachs Group Full time
Job Summary

As a Site Reliability Engineer, VP at The Goldman Sachs Group, you will be responsible for ensuring the reliability and scalability of our Procmon Platform. This platform is a highly scalable and reliable ecosystem for scheduling business-critical jobs across the firm.

Key Responsibilities
  • Own technical operations for systems that manage hundreds of thousands of compute cores
  • Build observability for new deployments to ensure robustness from day one, as well as mature deployments to identify and implement improvements
  • Troubleshoot and resolve issues with block devices, file descriptors, and packet loss
  • Lead real-time outage investigations and present postmortems to senior management
  • Define SLIs and SLOs and partner with development teams to ensure systems are sufficiently well designed and instrumented
  • Partner with our development team throughout development and operations
  • Plan and manage deployments and migrations (including end-of-life programs)
  • Plan and implement robust business continuity and security programs
  • Provide regional coverage for the Procmon platform and participate in on-call support

Requirements
  • 5+ years of relevant professional experience
  • 3+ years of Linux fundamentals and system administration skills
  • 3+ years of networking experience (familiarity with TCP/IP, IP routing, firewalls, secure tunneling protocols)
  • 3+ years experience working with distributed computing systems and Cloud computing environments
  • Excellent problem-solving and automation skills
  • Proficiency in at least one programming language; the team uses a mix of Go, Python and Erlang
  • Able to operate effectively in a mission-critical, highly regulated financial services environment

About The Goldman Sachs Group

At The Goldman Sachs Group, we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow.

We believe who you are makes you better at what you do. We're committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs.

Learn more about our culture, benefits, and careers.



  • Dallas, Texas, United States The Goldman Sachs Group Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer, VP to join our team at The Goldman Sachs Group. As a key member of our engineering team, you will be responsible for ensuring the reliability and scalability of our systems.ResponsibilitiesDesign and implement robust systems to manage hundreds of thousands of compute coresDevelop and...


  • Dallas, Texas, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based infrastructure.**Key Responsibilities:*** Design, implement, and maintain scalable and reliable cloud...


  • Dallas, Texas, United States The Goldman Sachs Group Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Goldman Sachs. As a Site Reliability Engineer, you will be responsible for ensuring the availability and reliability of our firm's most critical platform services.Key Responsibilities:Develop and implement incident management processes to ensure...


  • Dallas, Texas, United States Glow Networks Full time

    Site Reliability Engineer (SRE for Datacenter)At Glow Networks, we are seeking a highly skilled Site Reliability Engineer (SRE) to join our team. As an SRE, you will be responsible for ensuring the reliability and performance of our datacenter infrastructure. Responsibilities:Data monitoring and alerting, data quality assurance, and anomaly...


  • Dallas, Texas, United States Capgemini Full time

    Site Reliability Engineer Job DescriptionWe're seeking an experienced Site Reliability Engineer to join our team at Capgemini. As a Site Reliability Engineer, you'll play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure.Key Responsibilities:Design and implement scalable and reliable cloud...


  • Dallas, Texas, United States Mastech Digital Full time

    About the Role:We are seeking a skilled Site Reliability Engineer to join our team at Mastech Digital. As a Site Reliability Engineer, you will be responsible for ensuring the smooth operation of our IT systems and infrastructure.Key Responsibilities:Administration and troubleshooting in Linux and WindowsPatching and basic scripting skills (PowerShell,...


  • Dallas, Texas, United States Motion Recruitment Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Motion Recruitment Partners. As a key member of our infrastructure team, you will be responsible for ensuring the reliability, performance, and scalability of our systems.Key Responsibilities:Develop and implement tools to monitor key metrics of...


  • Dallas, Texas, United States Diamondpick Full time

    The roleDiamondpick is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the availability, reliability, and performance of our services and platforms in a highly transactional 24x7 environment.Key Responsibilities:Monitor application performance and take steps to improve...


  • Dallas, Texas, United States Veradigm Full time

    Welcome to Veradigm, where our mission is to transform health through innovative solutions. We are seeking a highly skilled Senior Site Reliability Engineer to join our team and help us achieve our goals.As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining robust, scalable, and reliable systems. You will...


  • Dallas, Texas, United States Saxon Global Full time

    Job Summary:We are seeking a skilled Site Reliability Engineer to ensure the reliability, availability, and performance of our production systems. As an SRE, you will work closely with cross-functional teams to design and implement tools and processes to automate deployment, observability, and troubleshooting of our applications and infrastructure.This...


  • Dallas, Texas, United States Motion Recruitment Partners LLC Full time

    Job Title: Site Reliability Engineer - AzureJob Description:Motion Recruitment Partners LLC is seeking a highly skilled Site Reliability Engineer - Azure to join their team. The ideal candidate will have a strong background in monitoring and recovery of data systems, with experience in Azure and cloud infrastructure.Key Responsibilities:Develop and utilize...


  • Dallas, Texas, United States CVS Health Full time

    Job SummaryAt CVS Health, we're committed to delivering exceptional healthcare experiences for our customers. As an Infrastructure Site Reliability Engineer, you'll play a critical role in designing, implementing, and managing the infrastructure systems and tools that enable reliability and performance of our technology platforms.Key ResponsibilitiesManage...


  • Dallas, Texas, United States Bayone Full time

    Job Title: Site Reliability Engineer - Cloud ExpertOverview:Bayone is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining highly available and scalable applications deployed in Azure. You will work closely with development teams to ensure...


  • Dallas, Texas, United States Goldman Sachs Full time

    About This RoleAt Goldman Sachs, we're committed to building and running large-scale, massively distributed, fault-tolerant systems. Our Site Reliability Engineering (SRE) team is responsible for ensuring the availability and reliability of our firm's most critical platform services, meeting the requirements of our internal and external...


  • Dallas, Texas, United States Goldman Sachs Full time

    About the RoleWe are seeking a talented Site Reliability Engineer to join our team at Goldman Sachs. As a Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining the firm's cloud infrastructure. You will work closely with our development team to ensure the smooth operation of our systems and services.Key...


  • Dallas, Texas, United States ThemeSoft Full time

    Role: SRE ArchitectLocation: Dallas, TXDescription:Foster a culture of reliability and efficiency by sharing best practices, approaches, and documentation across engineering teams.Automate manual tasks and system components to increase operational efficiency and reduce downtime.Troubleshoot and resolve complex issues in cloud-based SaaS and on-premise...


  • Dallas, Texas, United States Kyndryl Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Kyndryl. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and security of our systems and services.Key ResponsibilitiesDesign and implement automated solutions to enhance the stability and security of our...


  • Dallas, Texas, United States Goldman Sachs Full time

    About This RoleAt Goldman Sachs, we're committed to building and running large-scale, massively distributed, fault-tolerant systems. As a Site Reliability Engineer, you'll play a critical role in ensuring the availability and reliability of our firm's most critical platform services.ResponsibilitiesDevelop and support automation tooling to improve the...


  • Dallas, Texas, United States CARE Full time

    About CARECARE is a consumer tech company with a mission to solve a universal challenge: finding great care for the ones we love. We're a team of entrepreneurs, self-starters, and big thinkers united behind a common cause. Our culture and products reflect our values of empathy, innovation, and collaboration.Work EnvironmentCARE offers a hybrid work...


  • Dallas, Texas, United States Diverse Lynx Full time

    Job Title: Site Reliability ManagerJob Summary:We are seeking a Site Reliability Manager with 8 to 12 years of experience to manage geospatial data projects, ensure data integrity, and leverage advanced technologies to drive business outcomes.Key Responsibilities:• Make monitoring and alerting notify on symptoms and not on outages.• Document findings to...