Site Reliability Engineer

2 weeks ago

Seattle, WA, United States Kaav Inc. Full time

Who we are

We are a yoga-inspired technical apparel company up to big things. The practice and philosophy of yoga informs our overall purpose to elevate the world through the power of practice. We are proud to be a growing global company with locations all around the world, from Vancouver to Shanghai, and places in between. We owe our success to our innovative product, our emphasis on our stores, our commitment to our people, and the incredible connections we get to make in every community we are in.

About this team

Site Reliability Engineering

We are looking for a motivated engineer to join the Foundations team which is responsibility for observability and monitoring in Site Reliability Engineering, guiding the digital organization to improve the practice of reliability here at lululemon. We are a consultative enablement team providing guidance and support to product engineering teams for the development of high-quality and resilient software systems through the use of monitoring tools and practices. SRE partners with many product engineering teams across digital and beyond to infuse the concepts and practices of reliability into engineering process and deliverables. The Foundations team owns the management of our monitoring tools and the best practices for using those tools to provide total visibility into our systems. This role requires a vision and strategy for monitoring and how to manage it across a disparate organization.

As a SRE Engineer you will responsible for designing, implementing, and maintaining robust monitoring solutions, creating insightful dashboards, identifying relevant metrics, and driving efficient problem management practices. You will help identify observability maturity opportunities and roadblocks to success for digital teams and clearing those roadblocks. You will partner closely with Product Owners and Scrum Masters to manage scope and strike a balance between support and investment work. You are expected to clearly communicate risks to your partners for deliverables.

A day in the life

Collaborate with cross-functional teams to implement monitoring, logging, and tracing solutions that provide actionable insights and enable efficient troubleshooting and root cause analysis
Identify opportunities for improvement and organize efforts/team members needed to address the areas of improvement Work closely with technical and business partners to integrate monitoring into their products using Datadog Build SLO and status dashboards for digital teams on Datadog
Ensure your team delivers expertise and support that helps product teams increase the reliability of our systems Collaborate with other technical teams to align on best practices and standards
Balance incoming support requests with internal investment work using data to inform your decisions
Identify gaps in our observability tooling and infrastructure and recommend and implement appropriate solutions to enhance our monitoring capabilities
Drive automation efforts to ensure efficient collection, storage, and analysis of observability data, leveraging tools and technologies such as Datadog, Splunk, etc, and distributed tracing frameworks
Participate in incident response, post-incident root cause analysis, and problem management activities, providing expertise and recommendations to improve system reliability and prevent future incidents
Stay updated with the latest industry trends and advancements in observability and SRE practices, and drive the adoption of new tools and methodologies to enhance our observability capabilities
Mentor and guide junior team members, sharing your knowledge and expertise to foster a culture of learning and continuous improvement within the SRE Observability and Foundations team

Qualifications

Bachelor's degree in computer science/engineering or equivalent
5-8+ years of software engineering experience or SRE roles with a specific focus on observability
Familiarity with logging and monitoring solutions, log aggregation platforms, and distributed tracing frameworks
Experience in formulating and applying Service Level Objectives (SLOs)
Strong analytical and problem-solving skills, with a focus on root cause analysis and troubleshooting complex issues
Excellent collaboration and communication skills, with the ability to work effectively in cross-functional teams
Proven experience in driving automation initiatives and improving system reliability through observability practices
Relevant certifications such as Terraform Associate Certification and Certified Kubernetes Administrator

Bonus

Expertise in monitoring tools such as DataDog, Splunk, etc.
E-commerce experience preferred
Product ownership experience

Must haves

Acknowledges the presence of choice in every moment and takes personal responsibility for their life.
Possesses an entrepreneurial spirit and continuously innovates to achieve great results.
Communicates with honesty and kindness, and creates the space for others to do the same.
Leads with courage, knowing the possibility of greatness is bigger than the fear of failure.
Fosters connection by putting people first and building trusting relationships.
Integrates fun and joy as a way of being and working, aka doesn't take themselves too seriously.

Required Skills : Cloud
Additional Skills : Network Engineer

Site Reliability Engineer

4 days ago

Seattle, WA, United States Apple Full time

Role Number: 200635067-3337 Summary The Apple Service Engineering - SRE team is looking for Site Reliability Engineers with experience in developing processes, tools, and automation for managing distributed systems in production environments. Our SRE team combines software and systems engineering and system administration practices to build and run...
Site Reliability Engineer

34 minutes ago

Seattle, WA, United States Apple Full time

Role Number: 200635067-3337 Summary The Apple Service Engineering - SRE team is looking for Site Reliability Engineers with experience in developing processes, tools, and automation for managing distributed systems in production environments. Our SRE team combines software and systems engineering and system administration practices to build and run...
Site Reliability Engineer, Python

2 days ago

Seattle, WA, United States Next Step Systems LTD Full time

Site Reliability Engineer, Python, Seattle, WA There are 5 openings available for the Site Reliability Engineer position. These will be an onsite opportunities in either Los Angeles, CA; New York City, NY; or Seattle, WA. Responsibilities: - Manage cloud infrastructure, provide resource allocation, system upgrades, user access control etc. - Perform deep...
Site Reliability Engineer, Python

1 week ago

Seattle, WA, United States Next Step Systems LTD Full time

Site Reliability Engineer, Python, Seattle, WA There are 5 openings available for the Site Reliability Engineer position. These will be an onsite opportunities in either Los Angeles, CA; New York City, NY; or Seattle, WA. Responsibilities: - Manage cloud infrastructure, provide resource allocation, system upgrades, user access control etc. - Perform deep...
Site Reliability Engineer, Python

2 minutes ago

Seattle, WA, United States Next Step Systems LTD Full time

Site Reliability Engineer, Python, Seattle, WA There are 5 openings available for the Site Reliability Engineer position. These will be an onsite opportunities in either Los Angeles, CA; New York City, NY; or Seattle, WA. Responsibilities: - Manage cloud infrastructure, provide resource allocation, system upgrades, user access control etc. - Perform deep...
Senior Site Reliability Engineer

1 week ago

Seattle, WA, United States Dat Services Inc Full time

About DATDAT is an award-winning employer of choice and a next-generation SaaS technology company that has been at the leading edge of innovation in transportation supply chain logistics for 45 years. We continue to transform the industry year over year, by deploying a suite of software solutions to millions of customers every day - customers who depend on...
Senior Site Reliability Engineer

1 week ago

Seattle, WA, United States Dat Services Inc Full time

About DATDAT is an award-winning employer of choice and a next-generation SaaS technology company that has been at the leading edge of innovation in transportation supply chain logistics for 45 years. We continue to transform the industry year over year, by deploying a suite of software solutions to millions of customers every day - customers who depend on...
Senior Site Reliability Engineer

2 days ago

Seattle, WA, United States Dat Services Inc Full time

About DATDAT is an award-winning employer of choice and a next-generation SaaS technology company that has been at the leading edge of innovation in transportation supply chain logistics for 45 years. We continue to transform the industry year over year, by deploying a suite of software solutions to millions of customers every day - customers who depend on...
Senior Site Reliability Engineer

2 weeks ago

Seattle, WA, United States Zillow Group Full time

About the team The SRE team at Zillow Group empowers product teams to efficiently run "Zillow 2.0" services by reducing human error, focusing on automation, and providing deep insight into application behavior and health. By applying software engineering principles to infrastructure and operations, the team creates and manages scalable, reliable distributed...
Site Reliability Engineer-Remote

2 weeks ago

Seattle, WA, United States Georgia IT Inc Full time

Site Reliability Engineer Location - Remote - must be willing to work PST - High preference for someone local to Seattle Duration - 12 months Rate: DOE US Citizens and Green cards & GC-EAD Only. No Third-party C2C available for this job 8-10+ years of Site Reliability / DevOps Engineering Experienced with PowerShell Scripting. Should have extensive...

Americas

Europe

Asia / Oceania

Africa

Site Reliability Engineer