Infrastructure Reliability Engineer

2 weeks ago


Redmond, Washington, United States Microsoft Corporation Full time

Join us in transforming the educational landscape.

Are you interested in being part of a team dedicated to developing innovative applications and services that will redefine the learning journey? Do you aspire to work within an organization that prioritizes customer satisfaction and fosters an inclusive team environment? If this resonates with you, we may have the perfect opportunity. The Education engineering team plays a crucial role in our mission and serves as a breeding ground for M365 initiatives. We are committed to creating exceptional experiences for students, educators, administrators, and parents, with the ultimate goal of empowering every learner globally to achieve greater success.

We are in search of a Site Reliability Engineer (SRE) who possesses a balanced blend of systems engineering, software development, and online services experience, coupled with a strong commitment to quality. Your role will involve envisioning, designing, and delivering our highly scalable services that cater to millions of educators and learners worldwide.

Our engineering culture is characterized by data-driven decision-making, dynamism, and inclusivity. Team members are encouraged to explore creative ideas, formulate hypotheses, and implement them iteratively to learn and adapt swiftly. We are passionate about enhancing service agility and frequently deploying to production. Our services are built as distributed RESTful APIs deployed in Azure, utilizing the Office 365 Substrate layer and SharePoint, while our user interfaces are developed using ReactJs.

At Microsoft, our mission is to empower every individual and organization on the planet to achieve more. As part of our team, we unite with a growth mindset, innovate to empower others, and collaborate to achieve our shared objectives. Daily, we uphold our values of respect, integrity, and accountability to cultivate a culture of inclusion where everyone can thrive both at work and beyond.

Required Qualifications:

  • 3+ years of technical experience in software engineering, network engineering, or systems administration.
  • Alternatively, a Bachelor's Degree in Computer Science, Information Technology, or a related field.

Other Requirements:

  • Compliance with Microsoft, customer, and/or government security screening requirements is essential for this role. This includes, but is not limited to, specialized security screenings.

Preferred Qualifications:

  • Experience in software development, particularly in automation-related tasks. Proficiency in scripting languages such as Bash and PowerShell, or compiled languages like C and C# is highly valued, although other languages are acceptable.
  • Understanding of modern software and systems architectures, including load balancing, queuing, caching, distributed systems failure modes, and microservices.
  • Strong troubleshooting skills, including the ability to trace remote call chains across multiple service layers and a solid understanding of monitoring in distributed systems.
  • Familiarity with the following technologies is preferred:
  • Azure VMs, KeyVault, Service Fabric, ARM templates, Traffic Manager, Storage Accounts, Redis cache, and DevOps release and build pipelines.
  • Experience with Docker containers, Kubernetes, and fundamentals of Windows and Linux operating systems.
  • Knowledge of certificate lifecycle management, including creation, administration, domain registration, Azure KeyVault storage, and revocation.
  • Proficiency in source code management using Git, querying data in Azure Data Explorer, creating and building PowerBI dashboards, and authoring workflows using Power Automate.

Compensation: The typical base pay range for this role across the U.S. is USD $76,400 - $151,800 per year, with variations applicable to specific work locations.

Benefits: Additional benefits and compensation details are available based on the nature of your employment with Microsoft.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations, and ordinances.

We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you require assistance or a reasonable accommodation due to a disability during the application or recruiting process, please send a request via the appropriate channels.

Key Responsibilities:

  • Develop code, scripts, systems, or platforms that automate moderately complex but repetitive operational processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale.
  • Analyze data from telemetry pipelines and monitoring tools that detail operational metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale.
  • Respond to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root causes.
  • Share details related to incidents and their resolution through post-mortem reports and during regular review meetings with more experienced engineers and members of product engineering teams.

Embody our culture and values.



  • Redmond, Washington, United States Microsoft Corporation Full time

    At Microsoft Corporation, we are at the forefront of innovation in Cloud Infrastructure and Hardware Engineering. Our team is dedicated to supporting Microsoft's ambitious "Intelligent Cloud" vision, providing the essential infrastructure and foundational technologies that power over 200 online services, including major platforms like Office 365 and...


  • Redmond, Washington, United States SpaceX Full time

    Infrastructure Reliability Specialist (Starshield) - Top Secret Clearance at SpaceXSpaceX is driven by the vision of a future where humanity ventures beyond our planet, exploring the cosmos. We are dedicated to developing the technologies that will make this dream a reality, with the ultimate aim of establishing human life on Mars.Starshield utilizes...


  • Redmond, Washington, United States QData Full time

    Responsibilities Work with team of engineers focused on improving the reliability scalability latency and efficiency of PSTN services powering cloud communications. Managing problem resolution with service providers. Learning existing tools enhancing them to meet new scale and features aimed at reducing manual intervention enhancing prevention detection and...


  • Redmond, Washington, United States SpaceX Full time

    Infrastructure Systems Engineer at SpaceXAs a key member of the Facilities team, you will collaborate with engineers from various disciplines to design, construct, and activate essential infrastructure that supports SpaceX's ambitious long-term objectives.This team is accountable for delivering significant capital projects to diverse internal stakeholders...


  • Redmond, Washington, United States SpaceX Full time

    At SpaceX, we believe in a future where humanity explores the cosmos, and we are committed to developing the technologies that will make this vision a reality. Our mission is to enable human life on Mars through innovative engineering solutions.FACILITIES ENGINEERAs part of the Facilities team, you will collaborate with engineers from various disciplines to...


  • Redmond, Washington, United States 3md Full time

    U.S. Citizenship Requirement: This position requires U.S. citizenship due to applicable export control laws and regulations.Onsite Requirement: This role necessitates working on-site.Benefits: The organization provides comprehensive benefits including medical, dental, vision, short and long-term disability, life insurance, participation in a 401K plan, and...


  • Redmond, Washington, United States Space Exploration Technologies Corp. Full time

    Space Exploration Technologies Corp. was established with the vision that a future where humanity ventures into the cosmos is far more thrilling than one where we remain Earth-bound. Currently, SpaceX is at the forefront of developing the technologies necessary to realize this vision, with the ultimate objective of facilitating human existence on Mars....


  • Redmond, Washington, United States Microsoft Full time

    Join Microsoft as a Service Engineer within the Silver Infrastructure and Operations team, where you will play a pivotal role in supporting our Secure Work Area operations. This dedicated team is tasked with the deployment and management of a Secure Work Area, ensuring robust infrastructure for collaboration in a highly secure environment. In this position,...


  • Redmond, Washington, United States Microsoft Full time

    About the RoleWe are seeking a highly skilled Senior Cloud Reliability Engineer to join our team at Microsoft. As a key member of our cloud infrastructure team, you will be responsible for designing, developing, and delivering software engineering solutions to serve and protect our Office 365 government clouds.Key ResponsibilitiesDesign and Develop Software...


  • Redmond, Washington, United States 3MD Inc. Full time

    Job OverviewPosition Requirements:This role requires U.S. citizenship due to applicable export control laws and regulations. The position mandates onsite presence.Benefits:3MD Inc. provides a comprehensive benefits package including medical, dental, vision, short and long-term disability, life insurance, participation in a 401K plan, and applicable paid time...


  • Redmond, Washington, United States 3MD Inc. Full time

    Job OverviewNote: U.S. citizenship is a prerequisite for this role due to applicable export control laws and regulations.Onsite Requirement: This position requires presence at the workplace.Benefits: The organization provides comprehensive medical, dental, and vision coverage, along with short and long-term disability insurance, life insurance, participation...


  • Redmond, Washington, United States 3MD Inc. Full time

    Job OverviewPosition Requirement: U.S. citizenship is necessary for this role due to applicable export control laws and regulations. This position mandates onsite presence.Benefits: The organization provides comprehensive medical, dental, and vision coverage, short and long-term disability insurance, life insurance, participation in a 401K plan, and...


  • Redmond, Washington, United States Microsoft Corporation Full time

    Join Our Innovative Team at MicrosoftAre you ready to take on a pivotal role within the Microsoft Teams Site Reliability Engineering (SRE) team? This is an exceptional opportunity to contribute to cutting-edge solutions that enhance collaboration and teamwork.Understanding the Role of a Site Reliability Engineer (SRE)As an SRE, you will approach operational...


  • Redmond, Washington, United States SpaceX Full time

    At SpaceX, we are driven by the vision of making human life multi-planetary. Our commitment to innovation and excellence is reflected in our Facilities team, where we focus on the design, construction, and activation of critical infrastructure to support our ambitious goals.FACILITIES ENGINEERAs a key member of our Facilities team, you will collaborate with...


  • Redmond, Washington, United States Vaco Full time

    Key Responsibilities: - Proficient experience with Pure Storage and VMware is essential. - Coding skills in Java or Python are required, beyond mere scripting. - Provision and oversee customer infrastructure, which includes EC2, CloudFormation Templates, AppStream, and Load Balancers. - Create dashboards and alerts to promote Operational Excellence. -...


  • Redmond, Washington, United States Vaco Full time

    Key Responsibilities: Proven experience with Pure Storage and VMware is essential.Proficiency in coding with Java or Python is mandatory, not limited to scripting.Provision and oversee customer infrastructure, including EC2, CloudFormation Templates, AppStream, and Load Balancers.Create dashboards and alerts to enhance Operational Excellence.Engage with a...


  • Redmond, Washington, United States WaferWire Cloud Technologies Full time

    WaferWire Cloud Technologies is in search of a skilled Cloud Infrastructure Engineer to enhance our innovative team.Position: Cloud Infrastructure EngineerLocation: Redmond, WA (Onsite)Job Overview:About Us: We are a progressive organization committed to utilizing advanced cloud technologies to foster innovation and operational efficiency. We are looking for...


  • Redmond, Washington, United States Amazon Full time

    Job SummaryWe are seeking a highly experienced Reliability Engineering Manager to join our team at Amazon. As a key member of our Kuiper project, you will be responsible for developing and leading the reliability process, and staffing a team to deliver on program reliability objectives.Key ResponsibilitiesDevelop and lead the reliability process for Kuiper...


  • Redmond, Washington, United States Amazon Full time

    About the RoleWe are seeking a highly experienced Senior Manager to lead our Reliability Engineering team at Amazon. As a key member of our Project Kuiper team, you will play a critical role in launching a constellation of Low Earth Orbit satellites that will provide low-latency, high-speed broadband connectivity to unserved and underserved communities...


  • Redmond, Washington, United States Lyons Consulting Group Full time

    Job Overview As an Infrastructure Systems Engineer, you will be responsible for creating various metrics dashboards aimed at assessing our design execution quality both prior to and following Tapeout. This role involves collaboration with the CAD Infrastructure team, providing support to diverse design groups, including Architecture, Design,...