See more Collapse

Site Reliability Engineer

2 months ago


Seattle, United States CoreWeave Full time

CoreWeave is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry’s fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.About the role:

The Cloud Operations Team is the heart of CoreWeave’s operational practice.

In this role, you’ll help define and shape how Site Reliability Engineering (SRE) is implemented at CoreWeave.

The Cloud Operations team defines and implements tooling and processes that enable operational best practices and continual improvement across all engineering teams.

An ‘SRE of SREs,’ you’ll define and implement system and workflow automation ensuring service owners can rapidly identify and mitigate availability and performance regressions.

Collaborating across engineering, you support service owning SRE’s with the ‘picks and shovels’ they need to excel at running their services.

You will work with a team of 8-10 mixed-specialization engineers and have the opportunity to work on the full gamut of rewarding challenges that come with building the AI Cloud in a communicative, supportive, and high-performing environment.

As a member of the Cloud Operations Team you have the opportunity to:

With a customer first mindset, establish reliability and quality assessment patterns for all CoreWeave systems.

Improve the performance, security, reliability, and scalability of internal and externally facing services.

Develop dashboards, alerts, automated remediation, and insights into the customer experience using observability tools.

Create and maintain Kubernetes operators, custom controllers, and other tools to intelligently scale our operational capability.

Establish and integrate incident and change management tools and workflows.

Act as Incident Commander for priority incidents and lead post mortems.

Participate in on-call rotation as needed as we establish and operationalize this new team

Enable and evangelize reliability engineering across CoreWeave’s engineering teams.

Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.

Wondering if you’re a good fit?

We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match.

Here are some qualities we’ve found compatible with our team. If a portion of this resonates with you, we’d love to talk.

You have experience operating services in production and are interested in driving engineering practices such as: reliability at scale, testing (load, recovery, system etc.), progressive deployments, error budgets, observability, and fault-tolerant design.

You have experience automating manual processes and integrating various operations and productivity tools.

You’ve done some Linux shell scripting and/or can navigate a *nix-based operating system (with the right cheat sheet, if required).

You are familiar with debugging and administration of linux and Kubernetes environments.

You’re comfortable with the idea of codifying practices into Kubernetes controllers, operators, and other applications using a modern programming language.

You have experience with incident management for your team or an organization.

You’re comfortable in open source environments.

You’re excited to join a team with diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.

Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast

We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:

Be Curious at your Core

Act like an Owner

Empower Employees

Deliver Best In-Class Client Experience

Achieve More Together

We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us

Benefits

We offer a competitive salary and benefits, including:

Medical, dental and vision insurance - 100% paid for the employee

Company paid Life Insurance

Voluntary supplemental life insurance

Short and long-term disability insurance

Flexible Spending Account

Mental Wellness Benefits through Spring Health

Family-Forming support provided by Carrot

Paid Parental Leave

Flexible, full-service childcare support with Kinside

401(k) with a generous employer match

Flexible PTO

Catered lunch each day in our offices

Weekly massages in NJ office

A casual work environment

Work culture focused on innovative disruption

California Consumer Privacy Act - California applicants only

CoreWeave is an equal opportunity employer, committed to our diversity and inclusiveness. We will consider all qualified applicants without regard to race, color, nationality, gender, gender identity or expression, sexual orientation, religion, disability or age. #J-18808-Ljbffr


We have other current jobs related to this field that you can find below


  • Seattle, United States Capgemini Full time

    **Site Reliability Engineer** **FTE with benefits** Our team is looking to add experienced Site Reliability / DevOps Engineer to our team. + Experiencedwith **Python and Shell Scripting.** + **Shouldhave extensive experience with Azure or AWS (Azure preferred)** + **Experiencewith Monitoring and Observability - Datadog** + **Experiencewith Infrastructure as...


  • Seattle, United States Saxon Global Full time

    Starbucks Senior Site Reliability Engineer (Cloud) 8-month contract (Likely extension to 18 month with strong performance) Hybrid - (Must be local to the Seattle area, onsite at Starbucks headquarters 3 days a week with 2 days remote) Job Summary and Mission This position contributes to Starbucks on their Data Platform Services team. This team maintains...


  • Seattle, United States Perkins Coie Full time

    Job Description: Perkins Coie is seeking a highly skilled and experienced Site Reliability Engineer (SRE) specializing in automation and storage management to join our team. The ideal candidate will be responsible for designing, implementing, and maintaining our storage infrastructure to ensure high availability and performance. They will be part of the SRE...


  • Seattle, United States Capgemini Full time

    LeadSite Reliability Engineer Seattle,WA FTE/Direct hiring with benefits NoRemote - Onsite and Hybrid position fromWA location only Qualification& Skills 8+ years ofexperience in Site Reliability Engineering or related field Develop,maintain and configure cloud observability systems (e.g., Datadog, Splunk,OpenTelemetry, APM, etc.). Buildflexible...


  • Seattle, United States Perkins Coie Full time

    Job Description: Perkins Coie is seeking a highly skilled and experienced Site Reliability Engineer (SRE) specializing in automation and storage management to join our team. The ideal candidate will be responsible for designing, implementing, and maintaining our storage infrastructure to ensure high availability and performance. They will be part of the SRE...


  • Seattle, United States F5 Networks Full time

    At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate about cybersecurity, from protecting consumers from fraud to enabling companies to focus on innovation. Everything we do centers around...


  • Seattle, Washington, United States Flexe Full time

    Flexe solves the hardest omnichannel logistics problems for the world's largest retailers and brands. Integrating technology, open logistics networks, and elastic economic models allows Flexe customers to move fast, at scale, and with precision. Founded in 2013 and headquartered in Seattle, Flexe brings deep logistics expertise and enterprise-grade...


  • Seattle, United States SingleStore Full time

    Position Overview MemSQL is seeking a Senior Site Reliability Engineer to help drive our Kubernetes product strategy surrounding our managed service. You will be at the forefront; crafting the design, building out the collaborated vision, and sustaining your envisioned product strategy. This role will be an integral part of building our managed service...


  • Seattle, United States Capgemini Full time

    **LeadSite Reliability Engineer** **Seattle,WA** **FTE/Direct hiring with benefits** **NoRemote - Onsite and Hybrid position fromWA location only** **Qualification& Skills** + 8+ years ofexperience in Site Reliability Engineering or related field + Develop,maintain and configure cloud observability systems (e.g., Datadog, Splunk,OpenTelemetry, APM, etc.). +...


  • Seattle, United States Sentry Full time

    About Sentry Bad software is everywhere, and we’re tired of it. Sentry is on a mission to help developers write better software faster, so we can get back to enjoying technology. With more than $217 million in funding and 100,000+ organizations that believe we’re on to something, we're building performance and error monitoring tools that help companies...


  • Seattle, United States West500 Partners Full time

    Our client is a fast-growing downtown Seattle startup developing AI automation for professional services, including legal technology and medical records. They have a great product market fit and rapidly increasing revenues and are currently in need of a local Software Engineering Lead with CI/CD expertise, an AWS background, and a keen interest in innovative...


  • Seattle, United States West500 Partners Full time

    Our client is a fast-growing downtown Seattle startup developing AI automation for professional services, including legal technology and medical records. They have a great product market fit and rapidly increasing revenues and are currently in need of a local Software Engineering Lead with CI/CD expertise, an AWS background, and a keen interest in innovative...


  • Seattle, United States Oracle Full time

    OCI Incident Response is the first line of defense for maintaining the high availability of Oracle’s cloud. We make customer-impacting events shorter, less frequent, and less impactful by providing large-scale incident management. We are front-and-center in driving down event duration by using our operational experience, knowledge of standard processes,...


  • Seattle, United States Oracle Full time

    OCI Incident Response is the first line of defense for maintaining the high availability of Oracle’s cloud. We make customer-impacting events shorter, less frequent, and less impactful by providing large-scale incident management. We are front-and-center in driving down event duration by using our operational experience, knowledge of standard processes,...


  • Seattle, United States Apple Full time

    Senior Site Reliability Engineer, Object Storage Seattle, Washington, United States Software and Services The Apple Services Engineering (ASE) team is one of the most exciting examples of Apple’s long-held passion for combining art and technology. These are the people who power the App Store, Apple TV, Apple Music, Apple Podcasts, and Apple Books. And they...


  • Seattle, United States Censys Full time

    Censys knows the internet and cloud better than anyone else. Attack Surface Management provides customers with an attacker-centric view of all externally facing internet and cloud to extend visibility, prioritize, and remediate the most critical risk exposures that will actually lead to a breach. Our daily IPv4 scans and the world’s largest SSL/TLS...

  • Reliability Engineer

    2 weeks ago


    Seattle, United States JLL Full time

    OVERVIEW - Reliability Engineer JLL is seeking aReliability Engineerto join our team! In JLL Work Dynamics our most significant assets are our "People" and our "Clients". We will act with Dignity and Respect, make Ethical Decisions, champion Corporate Responsibility and serve as a driving force for a Sustainable Asset Management. There are opportunities for...

  • Software Engineer

    2 months ago


    Seattle, United States Lacework Full time

    At Lacework, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, big sky thinking, and obsess over getting the details right. We love what we do and are proud of our work to secure clouds and container environments for thousands of users...

  • Site Reliability

    1 month ago


    Seattle, United States Canonical Full time

    This role is an opportunity for a hands-on, but literally hands-off, technologist with a passion for Linux to build a career with Canonical and drive the success with those leveraging Ubuntu and open source products. If you have experience of IT operations automation, Infrastructure as Code and a passion for technology, then you will enjoy working with some...

  • Software Engineer

    1 month ago


    Seattle, United States Lacework Full time

    At Lacework, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Is this your next job Read the full description below to find out, and do not hesitate to make an application. Our team members enjoy solving complex problems, big sky thinking, and obsess over getting the details right....