Sr Site Reliability Engineer

2 weeks ago


Atlanta, United States STORD Full time

Stord is the leading commerce enablement provider of fulfillment services and technology that powers seamless checkout and delivery experiences for high-volume mid-market and enterprise brands across all channels. Stord manages over $5 billion of commerce annually through its fulfillment, warehousing, transportation, and operator-built software suite including OMS, Pre- and Post-Purchase, and WMS platforms.

With Stord, brands can sell more, save money, and reduce headaches.With Stord, brands can increase cart conversion, improve unit economics, and drive customer loyalty. Stord's end-to-end commerce solutions combine best-in-class omnichannel fulfillment and shipping with leading technology to ensure fast shipping, reliable delivery promises, easy access to more channels, and improved margins on every order.

Hundreds of leading DTC and B2B companies like AG1, Native, Tula, American Giant, and more trust Stord to make their supply chains a competitive advantage. Stord is headquartered in Atlanta with facilities across the United States, Canada, and Europe. Stord is backed by top-tier investors including Kleiner Perkins, Franklin Templeton, Founders Fund, and Salesforce Ventures.

Join us to help empower commerce brands with the best end-to-end customer and delivery experience.

About the SRE Position:

Stord is looking for a mission-driven Senior SRE to be a driving force behind an exceptionally resilient, efficient, and secure infrastructure and platform. You will be looked upon to expertly deliver a catalog of high-quality, world-class products and services to our customers at scale. We aim to establish a dynamic operational environment that seamlessly integrates cutting-edge technologies, embraces automation, has a high degree of ownership and fosters a culture of continuous improvement.

The SRE team is committed to accelerating development, enabling continuous delivery, enhancing security and ensuring operational excellence. This role is integral to designing and implementing the infrastructure and developer tooling that will enable Stord to scale our systems and processes, enhance reliability and availability in an efficient manner.

SRE operates cross-functionally, collaborating with product management, software developers, data science and other operations teams. At Stord, each member of the team has the ability to impact all aspects of the development process from ideation, design, delivery, maintenance, and operations.

What You'll Do:

  • Collaborate with cross-functional teams to design and implement CI/CD pipelines that automate fast and safe delivery of software to our customers, enable experimentation, create fast feedback loops and developer self-service capabilities.
  • Lead efforts in automating deployment, monitoring, and infrastructure management.
  • Proactively identify and resolve performance bottlenecks, system failures, and security vulnerabilities.
  • Minimize or eliminate degradations and failures related to fault tolerance, security, availability, and performance.
  • Develop SLOs and SLIs to manage risk through continuous monitoring and measurement of system performance.
  • Build, manage and deploy highly available, self-healing, customer facing production infrastructure and applications (microservice and event based architectures) using Docker, Kubernetes, Helm and Terraform.
  • Leverage 12 Factor App methodology when building and deploying all our services and systems.
  • Implement best practice infrastructure as code (IaC) principles for configuration management and deployment of infrastructure.
  • Enhance operational efficiency by identifying repetitive tasks and developing automation to eliminate toil work.
  • Implement robust metrics, monitoring and alerting for proactive issue identification and resolution.
  • Participate in incident response, on-call rotation and post-incident reviews to ensure 24/7 availability of critical systems and to learn from failures and continuously improve system reliability.
  • Implement and enforce security best practices for infrastructure and applications.
  • Collaborate with security teams to ensure compliance with industry standards and regulations.
  • Empower others by sharing knowledge through documentation, training, and mentorship.
What You'll Need:
  • Proven experience as a Senior DevOps Engineer or Senior Site Reliability Engineer.
  • Strong expertise in cloud platforms such as AWS, GCP or Azure.
  • Strong experience with CI/CD tools (Github Actions, GitLab CI, CircleCI) and version control systems (Git).
  • Proficiency with infrastructure-as-code tools (e.g., Terraform, Ansible, Cloudformation).
  • Hands-on experience with container orchestration tools like Docker and Kubernetes.
  • Solid understanding of networking, security, and system engineering.
  • Experience with monitoring and logging tools (e.g., Datadog, Prometheus, Grafana, ELK stack).
  • Strong scripting skills in languages such as Python, Shell or similar.
  • Familiarity with security best practices and compliance requirements.
  • Excellent problem-solving and troubleshooting skills.
  • Ability to work collaboratively in a fast-paced, agile environment.
  • Passion for building the highest-quality solutions for the long term that delight the customer (both internal and external customers).
  • Automation first mindset.
  • High degree of ownership and pride for work.
Bonus Points:
  • Industry certifications - (AWS, GCP, Linux Foundation - CKA, CKS, CKAD)
  • Bachelor's or higher degree in Computer Science, Information Technology, or a related field.
  • Previous startup experience
  • Previous logistics or supply chain experience
#LI-Remote

Culture Snapshot:

Our team is passionate about sitting at the intersection of enterprise technology and global logistics. The Stord company culture is electric, and we are proud to offer a career experience that will make you excited to come to work every day. We are creating an environment of continuous improvement through collaboration and diverse thinking by solving challenging problems and working with talented and smart colleagues. At Stord you will have daily opportunities to learn and inspire those around you. You will be surrounded by a team of self-starters who are motivated to have an impact through driving results.

Below are a few perks of joining our team:
  • Competitive salary and bonus
  • Friendly, Passionate, and Intelligent Employee Base
  • Creative Problem Solving and Entrepreneurial Thinking
  • Fast-Paced Environment
  • Low-Ego, Solution-Driven Culture
  • Community Involvement and Volunteer Opportunities
  • Employee Resource Groups: Women of Stord, JEDI (Justice, Equity, Diversity, & Inclusion), Stord-Serves, & More
Benefits:
  • 401(k)
  • Medical, Dental, and Vision Insurance
  • Life and Disability Insurance
  • Health Savings Account (HSA) option
  • Employee Assistance Program (EAP) - Mental Health Resources
  • Paid Parental Leave
  • Gym Stipend
  • Paid Time Off
  • Paid holidays
  • And more

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Stord participates in E-verify and will provide the federal government with your Form I-9 information to confirm that you are authorized to work in the U.S.

  • atlanta, United States Advansys Full time

    Job Title: Site Reliability Engineer Location: Alpharetta, GA (Locals Candidates only) Duration: Long term We seek a highly skilled Site Reliability Engineer and dynamic – Consultant In this role you will Maintain and improve the reliability, performance, and availability of software systems. Act as a bridge between traditional IT operations and...


  • Atlanta, United States Advansys Full time

    Job Title: Site Reliability Engineer Location: Alpharetta, GA (Locals Candidates only) Duration: Long term We seek a highly skilled Site Reliability Engineer and dynamic – Consultant In this role you will Maintain and improve the reliability, performance, and availability of software systems. Act as a bridge between traditional IT operations and...


  • Atlanta, United States Advansys Full time

    Job Title: Site Reliability Engineer Want to make an application Make sure your CV is up to date, then read the following job specs carefully before applying. Location: Alpharetta, GA (Locals Candidates only) Duration: Long term We seek a highly skilled Site Reliability Engineer and dynamic – Consultant In this role you will Maintain and improve the...


  • Atlanta, United States ACL Digital Full time

    Title:: Site Reliability EngineerLocation:: Atlanta, GA (Hybrid role, 3x days onsite/week)Type of Hire:: Contract (c2c/w2)Duration:: 12 months with possible extension Site Reliability Engineer (SRE) with AWS Cloud and Application Monitoring Experience** We are seeking a skilled Site Reliability Engineer (SRE) with expertise in AWS cloud infrastructure and...


  • Atlanta, United States ACL Digital Full time

    Title:: Site Reliability EngineerLocation:: Atlanta, GA (Hybrid role, 3x days onsite/week)Type of Hire:: Contract (c2c/w2)Duration:: 12 months with possible extension Site Reliability Engineer (SRE) with AWS Cloud and Application Monitoring Experience** We are seeking a skilled Site Reliability Engineer (SRE) with expertise in AWS cloud infrastructure and...

  • Sr. Software Engineer

    3 weeks ago


    Atlanta, United States Comcast Corporation Full time

    FreeWheel, a Comcast company, provides comprehensive ad platforms for publishers, advertisers, and media buyers. Powered by premium video content, robust data, and advanced technology, we're making it easier for buyers and sellers to transact across all screens, data types, and sales channels. As a global company, we have offices in nine countries and can...


  • Atlanta, United States Insight Global Full time

    Must Haves:5+ years of C# .NET Development ExperienceExperience building automated deploymentsIIS application pool experience Plusses:Splunk Scrum Experience Cloud knowledge and experience Day-to-Day Responsibilities:A Fortune 500 client of Insight Global is seeking a Site Reliability Engineer (SRE) to join their team on a hybrid basis. As the sole SRE, you...


  • Atlanta, United States Insight Global Full time

    Must Haves:5+ years of C# .NET Development ExperienceExperience building automated deploymentsIIS application pool experience Plusses:Splunk Scrum Experience Cloud knowledge and experience Day-to-Day Responsibilities:A Fortune 500 client of Insight Global is seeking a Site Reliability Engineer (SRE) to join their team on a hybrid basis. As the sole SRE, you...


  • Atlanta, United States Insight Global Full time

    Position Title: Site Reliability EngineerLocation: Atlanta, GA; Portland, ME; or Chattanooga, TN (3 days/week onsite)Compensation: $130-150k Duration: Full-Time, Direct Hire Job Overview:A Fortune 500 client of Insight Global is seeking a dedicated Site Reliability Engineer (SRE) to join their team. As the sole SRE, you will play a crucial role in...


  • Atlanta, United States Tata Consultancy Services Full time

    Job DescriptionAutomating work including infrastructure needs, testing, failover solutions, failure mitigation, and much moreDebugging complex problems across an entire stack and creating solid solutionsDeveloping and building CI/CD processes to improve cadenceUsing Chaos Engineering to test what you build under real-world conditionsTriage product or system...


  • Atlanta, United States Tata Consultancy Services Full time

    Job DescriptionAutomating work including infrastructure needs, testing, failover solutions, failure mitigation, and much moreDebugging complex problems across an entire stack and creating solid solutionsDeveloping and building CI/CD processes to improve cadenceUsing Chaos Engineering to test what you build under real-world conditionsTriage product or system...


  • Atlanta, United States Hermeus Full time

    Hermeus is an aerospace and defense technology company founded to radically accelerate air travel by delivering hypersonic aircraft. The company aims to develop hypersonic aircraft quickly and cost-effectively by integrating hardware-rich, iterative development with modern computing and autonomy. This approach has been validated through design, build, and...


  • Atlanta, United States Datum Technologies Group Full time

    Job Details:Site Reliability EngineerLong term contractAtlanta, GAQualifications:Must have Skills:Deep understanding of AWS services (Lambda, S3, SQS, IAM, Route 53 etc.) and proficiency in infrastructure as code (e.g., Terraform, CloudFormation).Hands-on experience with monitoring tools such as CloudWatch, Sumo Logic, Dynatrace, Grafana, or similar for...


  • Atlanta, United States Datum Technologies Group Full time

    Job Details:Site Reliability EngineerLong term contractAtlanta, GAQualifications:Must have Skills:Deep understanding of AWS services (Lambda, S3, SQS, IAM, Route 53 etc.) and proficiency in infrastructure as code (e.g., Terraform, CloudFormation).Hands-on experience with monitoring tools such as CloudWatch, Sumo Logic, Dynatrace, Grafana, or similar for...


  • Atlanta, United States Hermeus Full time

    Hermeus is an aerospace and defense technology company founded to radically accelerate air travel by delivering hypersonic aircraft. The company aims to develop hypersonic aircraft quickly and cost-effectively by integrating hardware-rich, iterative development with modern computing and autonomy. This approach has been validated through design, build, and...


  • Atlanta, Georgia, United States Advansys Full time

    About the Role:We are seeking a highly skilled Site Reliability Engineer to join our team at Advansys. As a key member of our infrastructure team, you will be responsible for maintaining and improving the reliability, performance, and availability of our software systems.Key Responsibilities:Maintain and improve the reliability, performance, and availability...


  • Atlanta, United States Cox Communications Full time

    This role is for an opening for a Senior Site Reliability Engineer (SRE) on the Manheim Logistics SRE team. The SRE team is tasked with designing and maintaining AWS infrastructure and deployment pipelines for Manheim Logistics 15 development teams. Reliability Engineer, Liability, Reliability, Engineer, Reliability, Monitoring, Technology


  • Atlanta, United States BeVera Solutions LLC Full time

    Job DescriptionJob DescriptionDescription:Company DescriptionBeVera Solutions, LLC is a fast-growing Data Science Consulting provider focused on delivering high-value solutions to its Federal Government customers. BeVera places a high premium on Integrity and Respect for all employees. Our CEO values every employee and fosters that attitude throughout the...


  • Atlanta, United States Motion Recruitment Full time

    A prominent insurance firm located in Atlanta is seeking skilled professionals to join their engineering team. They are currently in search of a DevOps/Senior Site Reliability Engineer for a full-time position, offering a hybrid work model at their Atlanta office. This company is at the cutting edge of innovation in content and presentation software designed...


  • Atlanta, United States Elite Mente llc. Full time

    Role: Site Reliability Engineer (SRE) Location: Atlanta, GA Key Responsibilities Design, implement, and maintain scalable and reliable cloud infrastructure on AWS. Monitor system performance and troubleshoot issues using AWS CloudWatch and other monitoring tools. Implement and maintain logging solutions, with a preference for Sumologic...