Site Reliability Engineer

6 days ago


Austin, United States Unreal Gigs Full time
Job DescriptionJob Description

Introduction:

Are you a systems expert who thrives on maintaining high availability, scalability, and performance in complex, distributed environments? Do you enjoy solving infrastructure challenges and automating everything in sight? If you're passionate about building resilient systems and ensuring 24/7 uptime, then our client has the perfect role for you. We’re looking for a Site Reliability Engineer (SRE) (aka The Uptime Guardian) to drive system reliability, automate operations, and ensure our services stay available even under pressure.

As a Site Reliability Engineer at our client, you’ll focus on building and maintaining highly reliable, scalable infrastructure that supports our products and services. You’ll be responsible for ensuring that our systems are optimized, automated, and robust enough to handle the demands of modern applications. This role blends software engineering, operations, and problem-solving, making it perfect for someone who enjoys working across multiple areas of the tech stack.

Key Responsibilities:

  1. System Monitoring and Incident Management:
  • Set up and manage monitoring, logging, and alerting systems using tools like Prometheus, Grafana, or ELK Stack. You’ll proactively identify and resolve issues before they impact users and be responsible for managing incidents when they arise.
  • Automation and Infrastructure as Code (IaC):
    • Automate everything From infrastructure provisioning to deployments and scaling, you’ll use tools like Terraform, Ansible, or Puppet to manage infrastructure as code. You’ll ensure that systems are built to scale and adapt automatically to load.
  • High Availability and Performance Optimization:
    • Ensure services and applications are always available and optimized for performance. You’ll design and implement strategies to improve uptime, reduce latency, and scale services efficiently, using techniques such as load balancing, failover systems, and clustering.
  • Disaster Recovery and Backup Solutions:
    • Design, implement, and test disaster recovery strategies and backup solutions. You’ll ensure that systems and data are recoverable in the event of an outage or failure, minimizing downtime and impact on users.
  • Collaboration with Development and DevOps Teams:
    • Work closely with developers and DevOps engineers to ensure that new features are reliable and scalable. You’ll collaborate to implement reliability engineering practices such as service level indicators (SLIs) and service level objectives (SLOs) and enforce best practices for system reliability.
  • On-Call Responsibilities and Incident Response:
    • Participate in on-call rotations to respond to incidents, troubleshoot problems, and bring systems back to normal operation. You’ll ensure smooth communication during outages and post-mortems to improve future reliability.
  • Capacity Planning and Scalability:
    • Perform capacity planning to ensure systems can handle traffic increases and growth. You’ll predict future demand and ensure that infrastructure scales smoothly to accommodate it.

Requirements

Required Skills:

  • System Reliability and Automation Expertise: Experience with building and maintaining highly reliable systems and automating infrastructure management using tools like Terraform, Ansible, or Puppet. You’re skilled at optimizing systems for uptime and performance.
  • Monitoring and Incident Management: Proficiency in setting up and managing monitoring, logging, and alerting systems like Prometheus, Grafana, or ELK Stack. You have experience with incident management and problem resolution.
  • Cloud Infrastructure Management: Hands-on experience managing cloud infrastructure on platforms such as AWS, GCP, or Azure. You’re skilled at deploying and maintaining scalable systems in the cloud.
  • Performance Optimization: Expertise in optimizing systems for low latency, high throughput, and minimal downtime. You understand load balancing, caching strategies, and database performance optimization.
  • Security and Compliance: Understanding of security best practices, encryption, and compliance frameworks such as SOC2 or GDPR. You ensure that systems are secure while maintaining reliability.

Educational Requirements:

  • Bachelor’s degree in Computer Science, Systems Engineering, or a related field. Equivalent experience in site reliability engineering, systems administration, or DevOps is also valued.
  • Certifications such as AWS Certified Solutions Architect, Kubernetes Administrator, or SRE Practitioner are a plus.

Experience Requirements:

  • 3+ years of experience in site reliability engineering or a similar role, with a focus on system automation, performance optimization, and cloud infrastructure management.
  • Proven experience managing large-scale, distributed systems with a focus on maintaining uptime, monitoring, and incident resolution.
  • Hands-on experience with containerization (Docker) and orchestration (Kubernetes) in a production environment.

Benefits

  • Health and Wellness: Comprehensive medical, dental, and vision insurance plans with low co-pays and premiums.
  • Paid Time Off: Competitive vacation, sick leave, and 20 paid holidays per year.
  • Work-Life Balance: Flexible work schedules and telecommuting options.
  • Professional Development: Opportunities for training, certification reimbursement, and career advancement programs.
  • Wellness Programs: Access to wellness programs, including gym memberships, health screenings, and mental health resources.
  • Life and Disability Insurance: Life insurance and short-term/long-term disability coverage.
  • Employee Assistance Program (EAP): Confidential counseling and support services for personal and professional challenges.
  • Tuition Reimbursement: Financial assistance for continuing education and professional development.
  • Community Engagement: Opportunities to participate in community service and volunteer activities.
  • Recognition Programs: Employee recognition programs to celebrate achievements and milestones.


  • Austin, Texas, United States Apex Systems Full time

    Job DescriptionPosition: Site Reliability EngineerLocation: RemoteDuration: 1 yearRate: $67/hr W-2We are seeking a highly skilled Site Reliability Engineer to join our team at Apex Systems. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure.Key...


  • Austin, Texas, United States Cape Henry Associates, Acquired by JANUS Research Group Full time

    Janus is looking for a seasoned Site Reliability Engineer / DevSecOps Developer to help grow our capability with our DoD clients.Develop Infrastructure as Code (IaC) designing, implementing, and maintaining infrastructure using IaC technologies(e.g. terraform or similar) ensuring scalable, reliable, and efficient platformsCollaborate with data and other...


  • Austin, United States JobRialto Full time

    Skills: 6+ years of experience in systems and platform operations and technology Experience with On Prem and Public Cloud - AWS, EKS Scripting languages like Python Linux Administration and Cloud, DevOps experience would be a plus Team As a member of the Site Reliability Engineering & Production Services team, you will work with other technology...


  • Austin, Texas, United States Apple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Apple. As a Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our systems and services.Key ResponsibilitiesDesign, build, and maintain robust infrastructure and automation solutionsWork closely with...


  • Austin, Texas, United States Thales Full time

    About the RoleThales is seeking an experienced Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance, and security of our cloud-based services.Key ResponsibilitiesCollaborate with project managers and service delivery managers to analyze traffic trends and capacity...


  • Austin, Texas, United States Expedia Group Full time

    Principal Site Reliability EngineerWe are looking for a highly qualified and seasoned Principal Site Reliability Engineer (SRE) to enhance our operations. The successful candidate will play a crucial role in guaranteeing the stability, scalability, and efficiency of our systems and services. You will collaborate closely with both development and operational...

  • Software Engineer

    5 days ago


    Austin, United States Apple Full time

    Carrier Services offer seamless integration of Apple Retail Stores and Apple Online store with major US Carriers for iPhone activations. We are looking for a talented Site Reliability Engineer to join our growing team. As an SRE, you will be responsi Engineer, Software Engineer, Liability, Reliability Engineer, Retail, Reliability, Technology


  • Austin, Texas, United States Apple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineering Manager to join our team at Apple. As a Site Reliability Engineering Manager, you will be responsible for leading a team that provides the platform for mission-critical cloud systems to maintain constant uptime, scale seamlessly, and allow for new applications and services to...


  • Austin, United States Visa Full time

    Company Description Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure...


  • Austin, Texas, United States Expedia Group Full time

    Principal Software Development Engineer - Site ReliabilityWe are looking for a highly proficient and seasoned Principal Software Development Engineer (SRE) to enhance our team. The successful candidate will be accountable for maintaining the reliability, scalability, and performance of our systems and services. You will collaborate closely with both...


  • Austin, United States Thales USA, Inc. Full time

    Location: Austin, United States of America. Thales people architect identity management and data protection solutions at the heart of digital security. Business and governments rely on us to bring trust to the billons of digital interactions they hav Reliability Engineer, Liability, Reliability, Engineer, Reliability, Monitoring


  • Austin, United States Computer Futures Full time

    Position Summary: We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our client in Austin. The ideal candidate will have a strong background in infrastructure as code (IaC), automation, container orchestration, and monitoring solutions. As an SRE, you will play a critical role in ensuring the reliability, scalability, and...


  • Austin, Texas, United States Apple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineering Manager to join our Apple Service Engineering team. As a key member of our team, you will be responsible for establishing and maintaining the reliability and scalability of our cloud services.Key ResponsibilitiesLead a team of engineers in providing a platform for mission-critical...


  • Austin, Texas, United States NinjaOne Full time

    About the RoleAt NinjaOne we are passionate about building unified IT solutions that simplify the way IT organizations work. We are currently looking for a Site Reliability Engineering Manager to join our Platform Engineering team and help us scale our products to millions of end-users. You will have the opportunity to build the SRE team from the ground up...


  • Austin, United States Terminal Industries Full time

    About Us Terminal builds software that digitizes, indexes, and automates the yard, leveraging best-in-class machine learning. Our platform provides warehouse operators with the intelligence needed to optimize their usage of trucks, trailers, chassis, containers and personnel. These are the fundamental operating assets of commerce - and represent the last...


  • Austin, Texas, United States Expedia Group Full time

    Principal Software Development Engineer - Site ReliabilityWe are in search of a highly qualified and seasoned Principal Software Development Engineer (SRE) to enhance our operations. The ideal candidate will be tasked with ensuring the dependability, scalability, and efficiency of our services and systems. You will collaborate closely with both development...


  • Austin, United States Terminal Industries Full time

    About Us Terminal builds software that digitizes, indexes, and automates the yard, leveraging best-in-class machine learning. Our platform provides warehouse operators with the intelligence needed to optimize their usage of trucks, trailers, chassis, containers and personnel. These are the fundamental operating assets of commerce - and represent the last...


  • Austin, United States Terminal Industries Full time

    About Us Terminal builds software that digitizes, indexes, and automates the yard, leveraging best-in-class machine learning. Our platform provides warehouse operators with the intelligence needed to optimize their usage of trucks, trailers, chassis, containers and personnel. These are the fundamental operating assets of commerce - and represent the last...


  • Austin, United States Terminal Industries Full time

    About Us Terminal builds software that digitizes, indexes, and automates the yard, leveraging best-in-class machine learning. Our platform provides warehouse operators with the intelligence needed to optimize their usage of trucks, trailers, chassis, containers and personnel. These are the fundamental operating assets of commerce - and represent the last...


  • Austin, TX, United States Visa Full time

    Company DescriptionVisa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure...