Staff Site Reliability Engineer

1 week ago

remote us Crisis Text Line Full time

Crisis Text Line provides free, 24/7, high-quality text-based mental health support and crisis intervention by empowering a community of trained volunteers to support people in their moments of need.Our mission is at the intersection of empathy and innovation — we promote mental well-being for people wherever they are.Our vision is an empathetic world where nobody feels alone. Our core values are at the heart of all we do: connect with empathy, center equity, get it done together, and reflect and evolve. Why you should join our team: Our work is transforming the way people in pain access support at their fingertipsOur work is innovative in the crisis response spaceOur dynamic, fun, and diverse cultureOur meaningful cause, led by empathy and innovationOur strong values at the center of all we doOur commitment to diversity, equity and inclusionOur commitment to engagement and belongingOur commitment to our employees and their holistic wellbeingOur value of work/life balanceOur growth mindset and prioritize professional developmentOur leaders who truly care What you'll be doing:Job Summary: As a Staff Site Reliability Engineer, you will play a crucial role in ensuring the reliability, scalability, and security of our platform, helping to architect, build, scale, and maintain the tooling that supports our software engineering teams and the infrastructure that enables our staff and volunteers to provide the Crisis Text Line service. You will assist in the technical leadership within the SRE team and work closely with developers to optimize performance, implement best practices, and maintain a secure environment. Much of your time will be focused on improving the productivity of other engineers. The team you help lead and the services you develop make it possible for our organization to pursue its mission and support our texters, volunteers, and staff. This position requires a strong background in infrastructure management and Site Reliability Engineering practices.Responsibilities: Function 1 - Infrastructure Management Develop and maintain monitoring, logging, and alerting systems to ensure system health and performance.Spread knowledge, provide mentorship, and promote technical best practicesWrite and review high-quality, easy-to-read, and testable code that follows best practices Function 2 - Automation and Incident Response Design, implement, and maintain our highly available and scalable AWS infrastructure that powers our service.Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.Assist in the operation of your team by providing engineering input and estimating work timelines during scoping meetingsConduct regular security audits and vulnerability assessments, addressing any identified issues. Function 3 - Communication & Collaboration Assisting to lead and mentor a team of SREs, fostering a collaborative and innovative work environment.Implement and enforce security best practices across the infrastructure and development processes.Respond to and resolve incidents, minimizing downtime and ensuring quick recovery.Support and encourage a diversity of backgrounds, voices, and perspectives on the engineering team Qualifications: Bachelor's degree in Computer Science, Engineering, or related field (Master's degree preferred). Proven experience as a Staff SRE or in a similar SRE role, with a strong focus on infrastructure and DevOps.Proficiency in cloud platforms (e.g., AWS, GCP, Azure) and infrastructure as code (e.g., Terraform, CloudFormation).Strong scripting and automation skills (e.g., Python, Bash, Ansible) and in-depth knowledge of containerization and orchestration (e.g., Docker, Kubernetes).Proven experience in implementing CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, GIthub Actions) and observability tools (e.g., Datadog, Prometheus, Grafana, ELK stack).A commitment to ethical practices, data privacy, and security. Preferred Qualifications: Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.Experience implementing Failure Injection / Chaos Engineering practices.Cloud Solution Architect certifications or completed training (e.g. AWS Cloud Practitioner Essentials and/or AWS Certified Solutions Architect - Associate) GCP or Azure.Strong experience with AWS Solution Architecture across Next.js, Go, PHP APIs, GraphQL, Databricks, and AI/ML workloads.Knowledge of compliance and regulatory standards (e.g., GDPR, HIPAA, ISO 27001, SOC2, etc.).Experience in a non-profit or mission-driven organization. Reliable High-Speed Internet Required: Must have a stable high-speed internet connection to support seamless remote collaboration, virtual meetings, online job tasks, etc.The full salary range for this position, across all United States geographies, is $126,000-$162,500 per year. The upper portion of the salary range is typically reserved for existing employees who demonstrate strong performance over time. Starting salary will vary by location, qualifications, and prior experience; during the interview process, candidates will learn the starting salary range applicable for their location. We pay competitively in the tech-forward nonprofit space and offer a robust benefits package.Only candidates in the following states will be eligible for employment: CA, CO, CT, FL, GA, HI, IL, IN, IA, MD, MA, MI, MO, NJ, NM, NY, NC, OH, PA, TN, TX, UT, VA, WA.Benefits:Crisis Text Line employee benefits are thoughtfully designed using an equity lens, acknowledging that we are all unique human beings with individual life circumstances that require flexibility and support. Benefits include: 20 paid holidays including: Federal holidays like Juneteenth and Labor DayElection dayHoliday break from Dec 24 through January 12 renewal days 2 floating holidays Flexible paid time off, including: 15 vacation days3 personal days7 sick days Medical, dental, and vision benefits for the staff member and family at no cost to the employee403B retirement plan (the nonprofit equivalent of a 401K): 3% contribution by Crisis Text Line to support building financial wellness, regardless of personal contribution12 weeks paid parental leave (after 6 months of employment) Student loan repayment (after 2 years of continuous full time service)Family support through a virtual childcare platformStipends/Allowances Mental health (Monthly) Internet Service (Monthly) Professional Development (Annual)Wellness (Annual)Home office setup (One time/First year) (Benefits are only for US-based employees. International benefits may differ).

Site Reliability Engineer

1 day ago

remote, us Epam Full time

Description DESCRIPTION Join our dynamic team as a Site Reliability Engineer and lead the way in optimizing and automating our Linux-based infrastructure. With 3 to 5 years of experience in Site Reliability Engineering, DevOps, or Infrastructure, you will play a crucial role in elevating our capabilities and ensuring high-impact, internet-facing production...
Senior Site Reliability Engineer

7 days ago

remote, us Epam Full time

Description DESCRIPTION Are you a seasoned professional with a passion for site reliability engineering and a knack for leading strategic initiatives? Join our dynamic team at EPAM, a leading global provider of digital platform engineering and software development services. We are seeking a Senior Site Reliability Engineer who can make a significant impact...
Azure DevOps Site Reliability Engineer

2 weeks ago

remote, us Epam Full time

Description DESCRIPTION Are you a skilled Azure DevOps Site Reliability Engineer with a passion for ensuring business continuity and helping businesses always be near their clients? Do you have experience in optimizing and supporting OSDU deployment, performing monitoring including incidents resolution, and suggesting improvements? If so, we have an exciting...
Site Reliability Engineer

1 week ago

remote, us Epam Full time

Description We are seeking a Site Reliability Engineer (Azure) to join our team. #Not found Responsibilities As a Lead Azure SRE, you will be responsible for driving the reliability, performance, and scalability of cloud-based applications and services. Your expertise in Kubernetes, scripting, troubleshooting, and observability will be instrumental in...
Senior Site Reliability Engineer

1 week ago

remote, us Epam Full time

Description DESCRIPTION Join EPAM as a Senior Site Reliability Engineer specializing in AWS! In this role, you'll ensure fleet services reliability and availability under the SRE model. If you have a good track record of highly scalable, distributed systems projects and previous experience working as an SRE, we'd love to hear from you. EPAM is a leading...
Site Reliability Engineer

2 days ago

Remote, Oregon, United States ADT Full time $200,000 - $250,000 per year

ADT is transitioning to an in-office model. New team members will work from home but should plan to return to an in-office model at a later date. We will keep you well informed and supported throughout the transition.Summary:We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. As an SRE, you will be responsible for...
AWS - Site Reliability Engineer

4 days ago

remote, us Epam Full time

Description DESCRIPTION Join EPAM as an AWS SRE. In this role, you'll collaborate with service teams to improve the reliability and efficiency of workloads and services using SRE practices. If you're a senior engineer with a good track record of highly scalable, distributed systems projects in the past 5 years, we'd love to hear from you. EPAM is a leading...
AWS Cloud Site Reliability Engineer

1 week ago

remote, us Epam Full time

Description DESCRIPTION Join EPAM as an AWS Cloud Site Reliability Engineer. In this role, you'll transfer security processes, manage authentication technologies, and support the implementation of a Palo Alto firewall. If you have 3+ years of experience with AWS, proficiency in designing and managing data migration processes, and superior communication...
Senior Site Reliability Engineer

2 weeks ago

Remote, United States Webflow Full time

At Webflow, our mission is to bring development superpowers to everyone. Webflow is the leading visual development platform for building powerful websites without writing code. By combining modern web development technologies into one platform, Webflow enables people to build websites visually, saving engineering time, while clean code seamlessly generates...
Senior Site Reliability Engineer II

4 days ago

Remote, Oregon, United States Shutterfly Full time $106,000 - $151,000 per year

At Shutterfly, we make life's experiences unforgettable. We believe there is extraordinary power in the self-expression. That's why our family of brands helps customers create products and capture moments that reflect who they uniquely are.Shutterfly is looking for a Senior Site Reliability Engineer to join our team. Shutterfly is undergoing a comprehensive...

Americas

Europe

Asia / Oceania

Africa

Staff Site Reliability Engineer