Senior Engineer, Site Reliability Engineering

3 weeks ago


Chicago, United States Balyasny Asset Management Full time

We are looking for a Senior Site Reliability Engineer who can cultivate our SRE philosophy, processes, and technologies from the ground up. As a Senior Site Reliability Engineer within the Platform group, you will lay the groundwork for our SRE infrastructure. Your role will entail driving standards and fostering adoption across our technology teams, whilst closely partnering with our DevOps and Cloud teams. With a hands-on approach, you'll work across both cloud and on-premises hosting platforms, ensuring the reliability and scalability of our trading systems and production environments. This is a chance to play a pivotal role in transforming our operational capabilities and enhancing performance across a wide array of environments and platforms. As a Site Reliability Engineer at BAM, you will: Develop and promote our SRE philosophy, establishing best practices and processes that will be instrumental in scaling our infrastructure. Create and maintain thorough documentation for SRE processes, systems design, and incident post-mortems to foster a culture of learning and improvement. Drive adoption of SRE principles across various technology teams, acting as a mentor and advisor to embedded SREs. Implement end-to-end observability and monitoring solutions using Prometheus, Grafana, Loki, and AWS CloudWatch, ensuring high visibility into application performance and infrastructure health. Utilize and build standards around Sentry for application monitoring and error tracking to proactively identify and address reliability issues. Review and define standards for application reliability requirements within our Kubernetes environment, ensuring application configuration is optimized for performance, cost and reliability. Develop automation and tooling to improve efficiency and reliability of deployment pipelines, system health checks, and recovery procedures. Collaborate with development teams to enhance service stability, scalability, and fault tolerance through SRE best practices like blameless post-mortems and service level objectives (SLOs). Conduct a regular review of the infrastructure and application metrics, logs, and traces to proactively spot and address potential issues before they affect customers. Introduce a reliability by default approach to software delivery. Core Tech Stack: Languages: Python, Java, NodeJS, C#, Shell Public cloud: AWS CI/CD: TeamCity, Octopus, Jenkins Configuration Management: Puppet, Ansible Infrastructure Code: Terraform, CloudFormation Application Management: Kubernetes, Docker, Helm OS: Linux and Windows Observability: Prometheus, Amazon CloudWatch, Sentry, Grafana, Loki To be considered a good cultural fit, you must be: An ambitious self-starter Hungry to learn Driven towards success A very strong and efficient communicator Able to multi-task and excel in a fast-paced trading environment A problem solver; able to develop quick and sound solutions to complex problems To be considered a good fit, you must have: 5+ years of experience in SRE or similar roles within complex, distributed systems environments. A Bachelor’s degree in engineering, computer science, information systems, or equivalent experience Proficient with key SRE technologies such as Prometheus, Grafana, Loki, AWS CloudWatch, and Sentry. Extensive knowledge of container orchestration using Kubernetes and containerization with Docker. Hands-on experience with both cloud (AWS preferred) and on-premises hosting platforms. Proven ability to script in languages like Python, Bash, or Go, to automate routine tasks and deployment pipelines. Strong understanding of CI/CD principles, agile methodologies, and DevOps culture. Excellent troubleshooting and problem-solving skills, with a systematic approach to handle unexpected situations. High level of initiative, passion for reliability engineering, detail orientation, and follow-through capabilities. Exceptional interpersonal and communication skills, with the ability to explain complex technical concepts to a diverse audience. Experience with immutable infrastructure, infrastructure automation and provisioning tools, such as AWS CloudFormation or Terraform Strong knowledge of Linux administration particularly RHEL and CentOS Strong knowledge of distributed systems concepts, including best practices and troubleshooting Knowledge of Windows Server administration and automation with PowerShell Operational understanding of networking concepts, architecture, and best practices, especially as it relates to hybrid cloud integration Analytical skills – Ability to troubleshoot and logically assess problems and determine solutions Detailed documentation skills – ability to represent ideas, requirements, reference architecture and problems in clear, concise, and business-friendly documents Bonus points for: Experience in a high throughput/low latency environment Experience with successful SRE team build outs Experience with security patterns and distributed authentication Experience managing high-pressure incident response Experience with Chaos Engineering technologies Contributions to open source libraries, projects, or communities Any AWS, Azure, or GCP resource specializations or certifications Any Kubernetes resource specializations or certifications Don’t have all the skills listed above? Have extra skills you think are important that we haven’t thought of? Please, let us know by applying and telling us a bit more about yourself and why you think you’re qualified

#J-18808-Ljbffr



  • Chicago, Illinois, United States Motion Recruitment Full time

    A financial company is looking for senior level Site Reliability Engineers to join their team in troubleshooting applications and managing their Azure environment. This will be a contract-to-hire position that is hybrid 3 days a week in the Chicago area. Expertise in Terraform, YAML, and Azure infrastructure is mandatory. This company is a global leader in...


  • Chicago, United States Deere & Company Full time

    Advanced Options 28 open jobs. Use your resume to get matched with the right job. Senior Platform Engineer (Chicago, Visa Sponsorship available) Reliability Engineer Dubuque, Iowa, United States Reliability Engineer Dubuque, Iowa, United States Senior Software Engineer - DevOps eCommerce (Chicago) SOFTWARE ENGINEER (Chicago, IL or Moline, IL - Hybrid) SAP...


  • Chicago, United States Allied Reliability Full time

    Overview: The Maintenance Reliability Engineer is responsible for implementing machinery and process improvements using management of change best practices while promoting values of a safe, environmentally compliant workplace, and philosophy of continuous improvement with the workforce. Responsibilities: Process Improvements and Operational Upgrading Works...


  • Chicago, United States Allied Reliability Full time

    Overview The Maintenance Reliability Engineer is responsible for implementing machinery and process improvements using management of change best practices while promoting values of a safe, environmentally compliant workplace, and philosophy of continuous improvement with the workforce. Responsibilities Process Improvements and Operational Upgrading Works...


  • Chicago, United States Motion Recruitment Partners, LLC Full time

    A financial company is looking for senior level Site Reliability Engineers to join their team in troubleshooting applications and managing their Azure environment. This will be a contract-to-hire position that is hybrid 3 days a week in the Chicago area. Expertise in Terraform, YAML, and Azure infrastructure is mandatory. This company is a global leader in...


  • Chicago, United States Motion Recruitment Full time

    A financial company is looking for senior level Site Reliability Engineers to join their team in troubleshooting applications and managing their Azure environment. This will be a contract-to-hire position that is hybrid 3 days a week in the Chicago area. Expertise in Terraform, YAML, and Azure infrastructure is mandatory. This company is a global leader in...


  • Chicago, United States Motion Recruitment Partners LLC Full time

    A financial company is looking for senior level Site Reliability Engineers to join their team in troubleshooting applications and managing their Azure environment. This will be a contract-to-hire position that is hybrid 3 days a week in the Chicago area. Expertise in Terraform, YAML, and Azure infrastructure is mandatory. This company is a global leader in...


  • Chicago, United States Rackera Inc Full time

    Find the below role Role : Site reliability engineerLocation :Chicago, IllinoisLong term project Job Description:6+ plus years of application development experience using modern technologies and architecture, including experience collaborating with technology teams.2 plus years of Site Reliability Engineering experience.Good Understanding of at least one...


  • Chicago, United States JobRialto Full time

    Top 3 requirements: Ecommerce experience (think Nordstrom, Target, where you purchase a product) Java Spring boot Kubernetes Plusses: Azure Kubernetes preferred Description: Client is looking for a forward-thinking, energetic Site Reliability Engineering Manager to join our team. Client serves the ecommerce needs of leading and growing grocery retailers...

  • Site Reliability Engineer

    46 minutes ago


    Chicago, United States JobRialto Full time

    Top 3 requirements: Ecommerce experience (think Nordstrom, Target, where you purchase a product) Java Spring boot Kubernetes Plusses: Azure Kubernetes preferred Description: Client is looking for a forward-thinking, energetic Site Reliability Engineering Manager to join our team. Client serves the ecommerce needs of leading and growing grocery retailers with...


  • Chicago, United States Cleo Full time

    Site Reliability Engineer At Cleo, we make doing business easy! Cleo is an established software company with a start-up feel. We have awesome products, which go hand in hand with our awesome culture! We are devoted to our people and pride ourselves on creating a fun, laid-back, but fast-paced work environment. Not only do we work hard, we play hard. We have...


  • Chicago, United States McDonald's Corporation Full time

    Job Description This opportunity is part of the DevOps COE in CPP Delivery office, where our mission is to help our product engineering teams deliver faster with improved quality and reliability. We work multi-functional with our global product teams and market teams in defining and executing on our automation test strategy, improving our build and deploy...


  • Chicago, United States R2 Global Full time

    Our client, a financial services giant, is looking for a Principal SRE professional to join the team and lead observability efforts throughout a major cloud project and beyond. This role will work 3x's a week in the Downtown Chicago area onsite. Key Responsibilities: Lead and mentor a team of site reliability engineers, fostering a culture of collaboration,...


  • Chicago, United States R2 Global Full time

    Our client, a financial services giant, is looking for a Principal SRE professional to join the team and lead observability efforts throughout a major cloud project and beyond. Take the next step in your career now, scroll down to read the full role description and make your application. This role will work 3x's a week in the Downtown Chicago area onsite. ...


  • Chicago, United States R2 Global Full time

    Our client, a financial services giant, is looking for a Principal SRE professional to join the team and lead observability efforts throughout a major cloud project and beyond.This role will work 3x's a week in the Downtown Chicago area onsite.Key Responsibilities:Lead and mentor a team of site reliability engineers, fostering a culture of collaboration,...


  • Chicago, United States R2 Global Full time

    Our client, a financial services giant, is looking for a Principal SRE professional to join the team and lead observability efforts throughout a major cloud project and beyond.This role will work 3x's a week in the Downtown Chicago area onsite.Key Responsibilities:Lead and mentor a team of site reliability engineers, fostering a culture of collaboration,...


  • Chicago, United States R2 Global Full time

    Our client, a financial services giant, is looking for a Principal SRE professional to join the team and lead observability efforts throughout a major cloud project and beyond.This role will work 3x's a week in the Downtown Chicago area onsite.Key Responsibilities:Lead and mentor a team of site reliability engineers, fostering a culture of collaboration,...


  • Chicago, United States Info Way Solutions Full time

    Site Reliability Engineer in Wealth Management Chicago (IL) / Tempe (AZ) Onsite Job ROLE: This role will be Responsible for application observability, maintenance, and support, identifying and implementing preventive measures proactively, evaluates and makes recommendation on techniques, practices, or technologies that would enhance business needs. As a SRE...


  • Chicago, United States Oak Street Health Full time

    Role DescriptionAs an Engineer I - Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, scalability, and performance of our systems and applications. You will work closely with cross-functional teams to implement automation, optimize processes, and enhance observability to maintain high availability and performance of our...


  • Chicago, United States AmericanEagle.com Full time

    Americaneagle.com is a family-owned web design, development, and digital marketing agency with a passionate belief in the power of technology to positively transform business practices. Our focus is on helping customers grow and achieve success in the digital space. We cover a variety of different industries, including eCommerce, associations & nonprofits,...