Current jobs related to Senior Site Reliability/Observability Engineer - San Francisco - Evolver


  • San Francisco, California, United States Pager Full time

    About the RolePagerDuty is seeking a highly skilled Senior Site Reliability Engineer to join our SRE-Platform team. As a key contributor, you will play a crucial role in building, maintaining, and scaling our Kubernetes platform.Key ResponsibilitiesMaintain the overall health of the platform, including triaging and troubleshooting production issues,...


  • San Francisco, California, United States Webflow Full time

    About the RoleWe're seeking a highly skilled Senior Site Reliability Engineer to join our team at Webflow. As a key member of our Engineering organization, you'll play a critical role in ensuring the reliability and stability of our customer-facing, production infrastructure.With millions of users worldwide, our platform is used by over 2 million users...


  • San Francisco, United States Fieldguide.ai Full time

    [Full Time] Senior Site Reliability Engineer at Fieldguide (United States) | BEAMSTART Jobs Senior Site Reliability Engineer Fieldguide United States Date Posted: 31 Oct, 2022 Work Location: San Francisco, United States Salary Offered: Not Specified Job Type: Full Time Experience Required: 3+ years Remote Work: Yes Stock Options: No Vacancies: 1...


  • San Francisco, California, United States Rootly Full time

    About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Rootly. As a key member of our Engineering team, you will be responsible for ensuring the reliability and scalability of our incident management platform.ResponsibilitiesParticipate in on-call rotations to support critical Rootly services and collaborate with...


  • San Francisco, California, United States smartrecruiters - JobBoard Full time

    Job Title: Senior Site Reliability EngineerAt Twitter, we're looking for a seasoned Senior Site Reliability Engineer to join our team. As a key member of our engineering organization, you'll be responsible for leading a team of site reliability engineers who work tirelessly to keep Twitter reliable and scalable.Key Responsibilities:Lead a team of site...


  • San Francisco, California, United States Outdefine Full time

    About the JobWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Outdefine. As a key member of our Infrastructure team, you will be responsible for ensuring the reliability and scalability of our blockchain-based systems.Key ResponsibilitiesRun internal Chainlink and Blockchain nodes to ensure seamless connectivity and data...


  • San Francisco, United States Autodesk Full time

    Job Requisition ID # 24WD81384 Position Overview At Autodesk, we're not just a world leader in 3D design, engineering, and entertainment software; we're a hub of innovation committed to solving complex design and real-world problems. Our extensive software suite empowers users across industries to bring their ideas to life and shape a sustainable future....


  • San Francisco, California, United States Autodesk Full time

    {"Responsibilities": "As a Senior Site Reliability Engineer at Autodesk, you will be responsible for leading the development and maintenance of robust cloud infrastructure to support millions of daily users. You will automate processes to improve system reliability and introduce best practices in continuous integration and deployment. You will also lead...


  • San Francisco, California, United States Celonis Full time

    About the RoleWe're Celonis, the global leader in Process Mining technology and one of the world's fastest-growing SaaS firms. We're looking for a highly skilled Senior Site Reliability Engineer to join our team.Key ResponsibilitiesDesign, implement, and manage cloud-based FedRAMP-compliant applications and platforms.Lead incident management escalations,...


  • San Francisco, California, United States Centene Full time

    About the RoleWe are seeking a highly skilled Senior Site Reliability Engineer to join our team at Centene. As a key member of our technology organization, you will play a critical role in ensuring the reliability, performance, and security of our platform infrastructure.Key ResponsibilitiesLead Projects and Initiatives: Help lead projects focused on...


  • San Francisco, United States Pager Full time

    PagerDuty empowers teams of all kinds to do the critical work that moves business forward through the PagerDuty Operations Cloud.PagerDuty is seeking a Senior Site Reliability Engineer to join our SRE-Platform team. In this role you will be a key contributor to building, maintaining and scaling the Kubernetes platform that powers PagerDuty. We build...


  • San Francisco, United States Autodesk Full time

    Senior Site Reliability Engineer Apply Location: San Francisco, CA, USA Time Type: Full time Posted On: Posted 3 Days Ago Job Requisition ID: 24WD81384 Position Overview At Autodesk, we're not just a world leader in 3D design, engineering, and entertainment software; we're a hub of innovation committed to solving complex design and real-world problems. Our...


  • San Francisco, California, United States Orb Full time

    About OrbOrb is a cutting-edge billing infrastructure company that empowers businesses to unlock their revenue potential. We believe that pricing and billing should not be a barrier to innovation and growth.Role & ImpactAs a Site Reliability Engineer at Orb, you will play a critical role in maintaining and scaling our robust infrastructure, ensuring...


  • San Francisco, United States Autodesk Full time

    Job Requisition ID #24WD81384Position OverviewAt Autodesk, we're not just a world leader in 3D design, engineering, and entertainment software; we're a hub of innovation committed to solving complex design and real-world problems. Our extensive software suite empowers users across industries to bring their ideas to life and shape a sustainable future. We're...


  • San Francisco, California, United States SpeedCast Full time

    {"h1": "Site Reliability Engineer at Speedcast", "p": "At Speedcast, we're looking for a skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a key role in ensuring the reliability and performance of our communication products. You will collaborate with our team of professionals to innovate and enhance our...


  • San Francisco, California, United States Doppler Full time

    {"title": "Senior Site Reliability Engineer", "description": "About DopplerDoppler is a fast-growing startup that aims to make it easy and secure for developers to manage their app configuration and secrets.Our team is entrepreneurial, with a bias for action. We value self-awareness, meaningful impact, and unconventional approaches.The RoleWe're looking for...


  • San Francisco, California, United States Circle Full time

    About CircleCircle is a leading financial technology company that is revolutionizing the way value is transferred globally. Our innovative infrastructure enables businesses, institutions, and developers to harness the power of blockchain technology and capitalize on the emerging internet of money.Job SummaryWe are seeking a highly skilled Senior Site...


  • San Francisco, California, United States Xero Full time

    About the RoleXero is a leading cloud-based accounting platform that empowers small businesses and their advisors to thrive. As a Site Reliability Engineer on our Reliability Enablement team, you'll play a critical role in ensuring the reliability and performance of our systems.Key ResponsibilitiesInvestigate operational surprises and support teams in...


  • San Francisco, United States Saxon Global Full time

    Lead DevOps/Site Reliability Enginee Looking for a resource more senior in the DevOps space, with a leaning toward site reliability engineering. Docker containers, Kubernetes automation Mostly focused on the automation, current pain points around deployments reliability around their data engineering processes. SRE who can go beyond the memory, what...


  • San Francisco, California, United States Twitter Full time

    Job DescriptionAt Twitter, we're committed to delivering a seamless user experience. As a Senior Site Reliability Engineer, you'll play a critical role in ensuring the reliability and scalability of our systems.Responsibilities:Lead a team of site reliability engineers to design and implement scalable systemsPartner with engineering leadership to achieve...

Senior Site Reliability/Observability Engineer

4 months ago


San Francisco, United States Evolver Full time
Job DescriptionJob Description

Senior Site Reliability/Observability Engineer 

Overview

Under the supervision of the Director of Infrastructure, the Senior DevOps Engineer will collaborate with infrastructure developers, architects, and vendors to maintain Site Reliability Engineering Practice. He/she will primarily be responsible for activities related to design, build and support the application stack in an operationally reliable and cost-effective manner 

The Infrastructure and Cloud Services team provides support to design, develop, and improve services, platforms and processes that result in improved end-to-end reliability and maintainability to our mission critical application and services. As stewards of the four golden signals – latency, traffic, errors and saturation, you will proactively seek out system weaknesses and remediate discovered issues before production issues occur using observability principles, trend analysis, and test resiliency using Chaos Engineering. 

Duties and Responsibilities

  • Develop deployment automation to provide a fully functional cloud stack in Azure that supports new and existing environments. 
  • Design and spearhead Observability Platform and contribute towards other reliability engineering related automation/tooling. 
  • Develop templates or scripts to automate everyday developer or operations functions. 
  • Monitor the performance of systems in a cloud based computing environment, including overall system health, reliability, performance, and cost. Able to identify bottlenecks and scale up/down resources to meet demands and expectations. 
  • Perform proactive daily system monitoring including reviewing system and application logs as well as responding to, triaging, troubleshooting and remediating incidents. 
  • Design, implement, and improve observability-related systems covering applications, infrastructure telemetry, and metric visualizations. 
  • Collaborate with partner and stakeholder teams to improve observability to support business objectives by identifying, collecting, and visualizing different metrics 
  • Design, implement, and improve observability-related systems covering applications, infrastructure telemetry, and metric visualizations 
  • Develop tools, dashboards, and training to provide teams with deeper insights into application and infrastructure performance. 
  • Create and review documentation and process regarding recurring issues, new standard operating procedures, knowledge transfer material, etc. 
  • Design and build an SRE function that owns application availability, performance and managing it through automation and proactive/predictive alerts using data analytical toolsets to identify areas of improvement for Dev and Ops teams. 
  • Implement comprehensive service monitoring to ensure uptime and performance, including synthetic, real user traffic, application performance, system level and dashboards 
  • Define, measure, and meet SLA/SLOs focusing on availability, performance, incidents, and chronic quality issues. Arm developers with deeper insights into application performance and service health issues towards reducing MTTA & MTTR 
  • Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues 
  • Lead, mentor and help grow other engineers, both SRE and generalist roles. 
  • Maintain and measure reliability, latency, and scalability for complex systems 
  • Develop operational best practices to control cloud usage costs. 
  • Build and maintain scalable and performant production infrastructure, improving and ensuring reliability across systems. 
  • Assist IT, Development, and Data teams with Cloud management tasks. 
  • Assist production support teams with troubleshooting anomalies. 
  • Plan and prioritize activities. 
  • Work on multiple projects at the same time. 
  • Report activities and progress to the team on daily scrum level meetings. 
  • Perform tasks as required by management. 
  • Provide support after hours and on weekends, when necessary. 

Basic Qualifications 

  • 5 years of hands-on experience with Azure Infrastructure and development solutions, including instrumenting .NET core and Angular/React applications. 
  • 5 years of experience in designing delivery pipelines, installations, configurations, automations and monitoring of various cloud services including: (laaS, PaaS, and SaaS). 
  • 5 years of leveraging NewRelic, DataDog or Splunk as a comprehensive observability tool, including Azure Log Analytics Workspace and App Insights. 
  • Bachelor’s degree in Computer Science, or a related discipline; or 7 years applicable work experience

Preferred Qualifications

  • Hands-on experience with infrastructure configuration automation and Infrastructure as Code (IaC, Terraform) 
  • Have a track record of leading successful SRE/DevOps projects 
  • Have impeccable ability to go into depth on topics such as scaling, networking, monitoring and security of containers in production. 
  • Extensive experience building scalable platforms leveraging containers in a production environment.  
  • Solid experience with DevOps implementations, migrations and upgrades within the Microsoft Cloud Azure solution suite. 
  • Experience with building and maintaining monitoring, logging, and/or tracing related systems 
  • Experience automating and running large-scale production services in Azure. 
  • Proficiency in Data Engineering or Data Platforms – Azure Data Factory, Snowflake, and Azure SQL Managed Instance. 
  •  Ability to build, adapt and standardize common frameworks, infrastructure, and processes across the organization to ensure production cloud workloads are stable and resilient. 
  • Experience and understanding of deployment strategies: Basic, Blue/Green, Canary, multi-service, Rolling, and A/B Testing. 
  • Experience with OS-level virtualization tools such as Docker. 
  • Experience with container orchestration tools in Azure such as Azure Kubernetes Service and Azure Container Instances. 
  • Experience with DevOps CICD tools and concepts – Jira/Confluence, Git, Gitflow, Ansible, Azure DevOps. 
  • Experience in Windows and Linux as build environments and optimizing which to use in different scenarios. 
  • Experienced with SAST, DAST, SCA and other code security tools. 
  • Experienced with Security Policy and Observability as code. 
  • Experience with scripting languages – PowerShell, .NET, C#, Python. 
  • Understanding of micro-services, containerization, and other app modernization strategies. 
  • Experience with Content Delivery Network such as Cloudflare, Fastly. 
  • Understanding of APIs, SDKs, and other integration methods. 
  • Familiarity with Network Infrastructure and Security. 
  • Knowledge of IP networking, private tunnel, VPN's, DNS, load balancing and firewall. 
  • Familiarity with Windows and Linux. 
  • Candidate must have excellent communication (verbal and written) and interpersonal skills, including effectively communicating with technical and non-technical team members. 
  • Implement and stay abreast of Cloud, SRE and DevOps industry best practices and tooling. 
  • Experience operating in a highly regulated industry. 
  • Detail oriented with excellent organization and analytical skills
  • Ability to plan and take initiatives to accomplish objectives in timely fashion
  • Ability to prioritize work and meet deadlines
  • Ability to establish and maintain effective working relationships with team members, supervisors, vendors, and employees from other departments. 

 

Company DescriptionEvolver is an equal opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law.Company DescriptionEvolver is an equal opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law.