Site Reliability Engineer II

4 days ago


Redmond, Washington, United States Microsoft Full time

Overview
Microsoft has an exciting opportunity for a Site Reliability Engineer II in the Cloud+AI Azure Data Team. Microsoft's Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence.

The products in the Azure Data portfolio include Microsoft Fabric, Azure SQL Databases, Azure Cosmos Databases, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build a data platform for the age of AI, powering a new class of data-first applications and driving a data culture. This team will be responsible for deploying and operating our Azure Data services in a Secure Work Area, including the infrastructure for collaboration within an Air-Gapped environment.

In this role, you will have the opportunity to work with engineers who enable a broad set of Azure services to be consumed by internal and external customers in highly secure and regulated industries. The systems and software you build will be required to meet the security policy and assurance requirements of both public and private sector customers.    Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

Responsibilities
The scale of our operations is enormous. Microsoft's products and services are overwhelmingly consumed online, and billions of people use them every day. We need people who enjoy analyzing complicated problems, coming up with creative solutions, working in focused teams to build things no-one has thought of before, all in the service of production reliability.

  • Acts as a Designated Responsible Individual (DRI) working on call to monitor service for degradation, downtime, or interruptions. Alerts stakeholders as to the status and gains approval to restore system/product/service for simple problems. Responds within Service Level Agreement (SLA) timeframe. Escalate issues to appropriate owners.
  • Contributes to efforts to collect, classify, and analyze data with little oversight on a range of metrics (e.g., health of the system, where bugs might be occurring). Contributes to the refinement of product features by escalating findings from analyses to inform decisions regarding the engineering of products.
  • Contributes to the development of automation within production and deployment of a complex product feature. Runs code in simulated, or other non-production environments to confirm functionality and error-free runtime for products with little to no oversight.
  • Contributes to efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility. Checks for visible evidence to demonstrate compliance for product areas. Develops and holds an understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft.
  • Remains current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.
  • Applies best practices to reliably build code that is based on well-established methods. Follows best practices for product development and scaling to customer requirements and applies best practices for meeting scaling needs and performance expectations.
  • Maintains communication with key partners across the Microsoft ecosystem of engineers. Considers partners across teams and their end goals for products to drive and achieve desirable user experiences and fitting the dynamic needs of partners/customers through product development.
  • Maintains operations of live service as issues arise on a rotational, on-call basis. Implements solutions and mitigations to more complex issues impacting performance or functionality of Live Site service and escalates as necessary. Reviews and writes issues postmortem and shares insights with the team.
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions. Alerts stakeholders as to status and initiates actions to restore system/product/service for simple problems and complex problems when appropriate. Responds within Service Level Agreement (SLA) timeframe. Drives efforts to reduce incident volume, looking globally at incidences and providing broad resolutions. Escalates issues to appropriate owners.
  • Drives efforts to integrate instrumentation for gathering telemetry data on system behavior such as performance, reliability, availability, usage, and safety mechanisms. Drives sustaining feedback loops from telemetry resulting in subsequent designs. Creates outputs of telemetry such as notifications or dashboards.
  • Drives efforts to collect, classify, and analyze data on a range of metrics (e.g., health of the system, where bugs might be occurring). Drives the refinement of products through data analytics and makes informed decisions in engineering products through data integration.

Qualifications
Required/minimum qualifications:

  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python

  • OR equivalent experience.

Other Requirements
Security Clearance Requirements Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:

  • The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph. Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. Failure to maintain or obtain the appropriate U.S. Government clearance and/or customer screening requirements may result in employment action up to and including termination.
  • Clearance Verification: This position requires successful verification of the stated security clearance to meet federal government customer requirements. You will be asked to provide clearance verification information prior to an offer of employment.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
  • Citizenship & Citizenship Verification: This position requires verification of U.S. citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local United States government agency customer and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, citizenship will be verified via a valid passport, or other approved documents, or verified US government Clearance.
Silver

Software Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about
requesting accommodations.



  • Redmond, Washington, United States Microsoft Full time $100,600 - $199,000

    OverviewThe Cloud & AI organization accelerates Microsoft's mission and bold ambitions to ensure that our company and industry is securing digital technology platforms, devices, and clouds in our customers' heterogeneous environments, as well as ensuring the security of our own internal estate. Our culture is centered on embracing a growth mindset, a theme...


  • Redmond, Washington, United States Microsoft Full time $100,600 - $199,000

    OverviewMicrosoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world.Microsoft's Azure Data engineering team is leading the transformation of analytics in the world...


  • Redmond, Washington, United States SpaceX Full time $160,000 - $220,000

    SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars.SR. SITE RELIABILITY ENGINEER (STARSHIELD) - TOP SECRET CLEARANCEStarshield...


  • Redmond, Washington, United States Microsoft Full time

    OverviewJoin the Microsoft Specialized Cloud (MSC) team - the next generation of platform and experiences enabling Microsoft and Azure, the fastest-growing cloud platform in the world which makes billions of dollars in revenue. As a Site Reliability Engineer on the MSC team, you will work alongside multiple teams within Azure, Office, and throughout...


  • Redmond, Washington, United States Microsoft Full time $119,800 - $234,700

    OverviewDo you want to be at the heart of cloud computing? The Compute team is at the core of Azure and is growing incredibly fast. We build and manage fault tolerant distributed systems on top of commodity datacenter hardware, to deliver an infrastructure for hosting customer applications. The platform is at the core of Azure that provides millions of...


  • Redmond, Washington, United States Microsoft Full time

    The Azure Senior Incident Manager - Site Reliability Engineer is responsible for driving the resolution of complex, multi-service outages across Azure's global infrastructure in our Air Gap Clouds. This role provides operational leadership during high-severity incidents, ensuring timely mitigation, clear stakeholder communication, and adherence to compliance...


  • Redmond, Washington, United States Jobs via Dice Full time

    Dice is the leading career destination for tech experts at every stage of their careers. Our client, SpaceX, is seeking the following. Apply via Dice todaySpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to...


  • Redmond, Washington, United States Amazon Full time

    Amazon is a leader in developing first of its kind hardware, such as Kindle, Echo and FireTV. Amazon reliability team aims to develop reliable and robust products that delight our customers. In this role, as a Hardware Reliability Engineer, you will be responsible for the reliability engineering of our new and emerging category of devices – Kuiper Customer...


  • Redmond, Washington, United States SpaceX Full time

    SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars. SR. PRINTED CIRCUIT BOARD RELIABILITY ENGINEER (STARLINK)At SpaceX we're...


  • Redmond, Washington, United States Microsoft Full time

    Microsoft Teams Meetings is transforming hybrid work globally through the power of AI. Our mission is to make meetings more intelligent, inclusive, and productive. We build across the full stack, delivering seamless, AI-powered experiences that elevate everyday collaboration.As a Software Engineer II on the Teams Meetings Facilitator team, you'll partner...