Manager, Site Reliability Engineering
2 weeks ago
Department s to ensure smooth integration of applications and systems. Define and enforce Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure system reliability and uptime. Monitor system performance, troubleshoot issues, and ensure timely incident response, root cause analysis, and problem resolution. Implement effective monitoring, logging, and alerting systems to proactively identify and mitigate potential issues. Stay up-to-date with industry trends, emerging technologies, and best practices related to SRE and DevOps, and apply them to improve operational efficiency. Identify potential risks to system reliability and implement strategies to mitigate them. Ensure that all systems and processes comply with relevant regulations, standards, and best practices. Minimum Qualifications: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience). Proven experience as a Site Reliability Engineer or similar role, with at least 3-5 years of hands-on experience in managing production systems. Strong expertise in the listed technologies: Ansible, Concourse CI, Jenkins, Github Actions, EKS (Kubernetes), Linux Administration, terraform. Demonstrated experience in leading and managing a team of technical professionals for at least 2 years. Solid understanding of SRE principles, including reliability, scalability, availability, and performance. Proficient in scripting and automation (e.g., Python, Bash, or similar). Experience with infrastructure-as-code (IaC) tools, configuration management, and CI/CD pipelines. Knowledge of cloud platforms (e.g., AWS, Azure, or Google Cloud) and containerization technologies (e.g., Docker). Excellent problem-solving skills and the ability to thrive in a fast-paced, dynamic environment. Strong communication and leadership skills, with the ability to collaborate effectively with both technical and non-technical stakeholders. Preferred Qualifications: Relevant certifications, such as Certified Kubernetes Administrator (CKA) or AWS Certified DevOps Engineer. Experience with monitoring and observability tools (e.g., Datadog, New Relic, Prometheus, Grafana, ELK Stack). Familiarity with agile methodologies and experience working in an Agile/Scrum environment. It Pays to Work Here The compensation & benefits package for this role includes: Competitive starting salary A discretionary annual bonus Long-term incentive in the form of a new hire equity grant Comprehensive health plans 401K with company matching Paid Parental Leave Flexible time off Salary Range : The base salary range for this role is between $172,000 - $215,000 in the State of New York, the State of California and the State of Washington. This range is not inclusive of our discretionary bonus or equity package. When determining a candidate’s compensation, we consider a number of factors including skillset, experience, job scope, and current market data.
-
Site Reliability Engineer
2 weeks ago
USA, United States TwinStream Full time $120,000 - $140,000 per yearWho are we:In 2019, our founders were working as engineers solving complex cross domain problems within government organisationsTwinStream was formed to consolidate their collective expertise and experience into one business, providing technical excellence and exceptional service to their clients. We have teams working both on-site with clients and remotely...
-
Site Reliability Engineer
1 week ago
USA, United States Baseten Full time $200,000 - $250,000 per yearABOUT BASETENBaseten powers inference for the world's most dynamic AI companies, like OpenEvidence, Clay, Mirage, Gamma, Sourcegraph, Writer, Abridge, Bland, and Zed. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. With...
-
Site Reliability Engineer with 2K
3 days ago
USA, United States eTek IT Services Full timeJob DescriptionPosition: Site reliability Engineer Location: Remote Duration: 1 year Required Qualification:6+ years of demonstrated influence across one or more teams for large scale projects that drive impact and improvement across the organization 6+ years of developing tools for automation of processes or augmenting off the shelf tool functionality6+...
-
Staff Site Reliability Engineer, Platform
7 days ago
(usa), United States GEMINI Full timeDepartment : Platform Our Platform organization’s purpose is to enable Gemini to scale effectively and empower our engineering teams to focus on building innovative financial products and experiences for individuals around the world. Platform focuses around building a scalable and secure foundations platform, enabling Engineering to deploy, validate, and...
-
(usa), United States GEMINI Full timeDepartment : Platform Our Platform organization’s purpose is to enable Gemini to scale effectively and empower our engineering teams to focus on building innovative financial products and experiences for individuals around the world. Platform focuses around building a scalable and secure foundations platform, enabling Engineering to deploy, validate, and...
-
Senior Site Reliability Engineer, Onchain
2 weeks ago
(usa), United States GEMINI Full timeDepartment : Onchain The Role: Senior Site Reliability Engineer The infrastructure team at Gemini creates and manages software tools and platforms, automates the creation and support of this infrastructure, helps integrate complex processes, and supports secure data access. Security of customers’ digital assets and personal information held with Gemini is...
-
Site Reliability Engineer
22 hours ago
USA, VA, McLean ( Greensboro Dr, Hamilton), United States Booz Allen Hamilton Full timeSite Reliability EngineerThe Opportunity:Everyone is trying to "harness the cloud," but not everyone knows how. As a DevOps engineer, you're eager to develop, manage, and secure a container platform that meets your client's needs and takes advantage of cloud capabilities. We need you to help us develop container management software to solve some of our...
-
Principal Reliability Engineer
6 days ago
MA: Innovation Dr Tewks Bdg North Street Building , Tewksbury, MA, USA, United States RTX Full time $101,000 - $203,000Date Posted: Country:United States of AmericaLocation:MA134: Innovation Dr Tewks Bdg North Street Building 400, Tewksbury, MA, 01876 USAPosition Role Type:OnsiteU.S. Citizen, U.S. Person, or Immigration Status Requirements: Active and transferable U.S. government issued security clearance is required prior to start date. U.S. citizenship is required, as...
-
Intern
2 weeks ago
usa, United States SURVICE Engineering Company Full timeJoin Us in Making a Difference in the Lives of Those Defending our Nation!Why SURVICE?Come join the SURVICE Engineering mission to protect, enhance, and enable those who defend the United States. Since 1981, we have supported the DoD community, as well as Homeland Security, advanced technologies, environmental, and commercial markets. Our employees have...
-
Senior Infrastructure Engineer
3 days ago
USA, United States Octane Full timeOctane is unlocking the power of financial products for merchants and consumers. Our cutting-edge technology and innovative financial products empower businesses with more control and flexibility, enabling them to deliver seamless digital experiences, drive customer loyalty, and build long-term value.Octane supports merchants throughout the sales cycle:...