Site Reliability Engineer

3 months ago


Atlanta, United States Softworld, a Kelly Company Full time

The Cloud Site Reliability Engineer (SRE) works closely with cloud development team, IT operations team and business partners to streamline and implement enhanced monitoring and alerting capability across infrastructure, application layers. By leveraging automation tools, SREs address and resolve issues, minimizing manual workload and enhancing system scalability and reliability. Their core focus lies in standardization and automation to build and run fault-tolerant systems. Typically, SREs possess a background in software engineering, system engineering, or system administration, coupled with substantial IT operations experience. SREs oversee availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.


  • Writing and developing code to automate processes, such as analyzing logs, testing production environments and responding to any issues?
  • Collaborates with agile teams and business partners to develop specifications that resolve problems and enhancement needs, including focusing on monitoring, and metrics for operational readiness
  • Identify bottlenecks in development and deployment processes and designs automation solutions to mitigate?
  • Develop new capabilities in displaying/monitoring/alerting on key performance indicators by tracking business transactions in real-time
  • Maintain and grow knowledge of platform configuration management, monitoring of established metrics, and troubleshooting ?
  • Provides continuous feedback to development teams on system stability, defect analysis, and system enhancements ?
  • Design and develop alert escalation and incident response automation?
  • Provide production support for cloud service outages and incidents and work on both tactical and strategic plans for outage prevention?
  • Provide feedback on resiliency and maintainability of solutions to Cloud and App architects?
  • Conduct disaster recovery scenario generation and testing?
  • Implement sustainable, audit-ready processes that support information technology controls, including deployment execution, access management, audits, incident management and related requirements.



Must-have technical skills:

  • Should have at least 3 years’ experience as a site reliability engineer on a cross functional agile team working in Azure.
  • Have working knowledge of agile development methodologies (scrum, sprints, KanBan etc.) and tools (Azure DevOps etc.)
  • Have at least 3 years hands-on experience using IaC tools Terraform, Github, Ansible and Packer
  • Proven experience across testing, integration, source code management, deployment and containerization
  • Sound problem-solving skills with the ability to quickly process complex information and present it clearly and simply?
  • Experience with cloud technologies and services including those for Compute, Storage, Databases and API Management
  • On-premise to cloud migration experience



  • Atlanta, Georgia, United States Allied Reliability Full time

    About the Position">As a Reliability Engineer - Electrical at Allied Reliability, you will play a key role in developing and implementing strategies to improve the reliability and efficiency of our equipment and systems. This position requires a high level of technical expertise, as well as excellent analytical and problem-solving skills.">Key...


  • Atlanta, United States Softworld, a Kelly Company Full time

    The Cloud Site Reliability Engineer (SRE) works closely with cloud development team, IT operations team and business partners to streamline and implement enhanced monitoring and alerting capability across infrastructure, application layers. By leveraging automation tools, SREs address and resolve issues, minimizing manual workload and enhancing system...


  • Atlanta, United States Motion Recruitment Full time

    Job Title: Automation Engineer - Cloud and ReliabilityJob Responsibilities:Develop scripts to automate processes and reduce toil and failures.Monitor the health of applications, batch processes, and data feeds.Set up monitoring systems and develop dashboards for performance tracking.Lead and triage major incidents, investigating and troubleshooting...


  • Atlanta, United States Motion Recruitment Full time

    Job Title: Automation Engineer - Cloud and ReliabilityJob Responsibilities:Develop scripts to automate processes and reduce toil and failures.Monitor the health of applications, batch processes, and data feeds.Set up monitoring systems and develop dashboards for performance tracking.Lead and triage major incidents, investigating and troubleshooting...


  • Atlanta, Georgia, United States Engle Martin & Associates Full time

    About the JobEngle Martin & Associates is seeking an experienced Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our systems. You will work closely with development teams to design, deploy, and maintain scalable and reliable systems using modern...


  • Atlanta, Georgia, United States Softworld, a Kelly Company Full time

    Job Description">The Cloud Site Reliability Engineer works closely with cloud development team, IT operations team, and business partners to streamline and implement enhanced monitoring and alerting capability across infrastructure, application layers.">Responsibilities">Writing and developing code to automate processes, such as analyzing logs, testing...


  • Atlanta, Georgia, United States Resource Informatics Group Inc Full time

    Job Overview As a Site Reliability Engineer at Resource Informatics Group Inc, you will be part of a team devoted to providing automated solutions and services for Cox Automotive. Your mission will be to measure, evaluate, and plan for visible, reliable application delivery and maintenance. We are looking for engineers who are passionate about...


  • Atlanta, United States Cox Automotive Full time

    Cox Automotive is looking for a Senior Site Reliability Engineer (SRE) to join our Manheim Logistics SRE team . The SRE team is tasked with designing and maintaining AWS infrastructure and deployment pipelines for Manheim Logistics' 15+ development teams. The team has currently standardized on a Docker-based infrastructure solution and is adding...


  • Atlanta, United States Canonical Full time

    Job DescriptionJob DescriptionCanonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is very widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our customers include the world's leading...


  • Atlanta, United States Motion Recruitment Full time

    ONLY W2 - ONLY W2Required Skills & ExperienceManage and optimize data streaming and API components in OpenShift On premise and AWS.Proactively review the application’s APIs and processes to identify opportunities to optimize the response times for various application components.Automate various types of testing including data quality checks, automate...


  • Atlanta, United States Motion Recruitment Full time

    ONLY W2 - ONLY W2Required Skills & ExperienceManage and optimize data streaming and API components in OpenShift On premise and AWS.Proactively review the application’s APIs and processes to identify opportunities to optimize the response times for various application components.Automate various types of testing including data quality checks, automate...


  • Atlanta, United States Disability Solutions Full time

    Position Type : Full time Type Of Hire : Experienced (relevant combo of work and education) Education Desired : Bachelor's Degree Travel Percentage : 0%Job DescriptionAre you curious, motivated, and forward-thinking? At FIS you’ll have the opportunity to work on some of the most challenging and relevant issues in financial services and technology. Our...


  • Atlanta, Georgia, United States Disability Solutions Full time

    Job OverviewFIS is a leading provider of disability solutions, and we are seeking a skilled Site Reliability Specialist to join our team.Job DescriptionWe are looking for an experienced professional who can participate in all day-to-day activities of operating the payment infrastructure to maintain high stability, reduce service downtime, and improve quality...


  • Atlanta, United States ACL Digital Full time

    Title: Site Reliability Engineer Work Location: Atlanta, GA Duration: 12 months Site Reliability Engineer (SRE) with AWS Cloud and Application Monitoring Experience We are seeking a skilled Site Reliability Engineer (SRE) with expertise in AWS cloud infrastructure and robust application monitoring capabilities. As an integral part of our team, you...


  • Atlanta, Georgia, United States Inabia Software & Consulting Inc. Full time

    About the Position:We are looking for a highly skilled Site Reliability Engineer to join our team at Inabia Software & Consulting Inc. As a key member of our engineering team, you will be responsible for designing, building, and operating large-scale distributed systems.Main Responsibilities:Kubernetes Cluster Management: Design, deploy, and manage...


  • Atlanta, Georgia, United States RIT Solutions, Inc. Full time

    Responsibilities and QualificationsThe DevOps Engineer will be responsible for ensuring the reliability, scalability, and performance of our cloud-hosted applications. This role requires a strong understanding of DevOps practices, including CI/CD pipelines and automation scripts. The ideal candidate will have experience working with Kubernetes, AWS EKS, and...


  • atlanta, United States Motion Recruitment Full time

    ONLY W2 - ONLY W2Required Skills & ExperienceManage and optimize data streaming and API components in OpenShift On premise and AWS.Proactively review the application’s APIs and processes to identify opportunities to optimize the response times for various application components.Automate various types of testing including data quality checks, automate...


  • Atlanta, Georgia, United States Allied Reliability Full time

    Job SummaryWe are seeking a Plant Electrical Engineer to join our maintenance team at Allied Reliability. As an integral part of our operations, you will play a critical role in ensuring the smooth operation of our plant's machinery and equipment.This is an excellent opportunity for a highly skilled engineer with experience in industrial electrical...


  • Atlanta, Georgia, United States RIT Solutions, Inc. Full time

    About UsRIT Solutions, Inc. is a leading provider of innovative technology solutions. We are committed to delivering high-quality products and services that meet the evolving needs of our customers.Job DescriptionWe are seeking a skilled Mid-Level DevOps Engineer to join our team. As a key member of our engineering team, you will be responsible for...


  • Atlanta, United States Motion Recruitment Partners, LLC Full time

    A leading provider in the world of insurance protection is looking to add a Site Reliability Engineer to their team. Integrating physical and digital risk mitigation solutions to reduce fraud in the insurance sector is key. Day to day tasks involve using Azure services, including AKS and Azure DevOps for CI/CD pipelining. Working in the environment on their...