Software Engineer II
6 days ago
Overview
The AI Platform organization builds the end-to-end Azure AI stack, from the infrastructure layer to the PaaS and user experience offerings for AI application builders, researchers, and major partner groups across Microsoft. The platform is core to Azure's innovation, differentiation and operational efficiency, as well as the AI-related capabilities of all of Microsoft's flagship products, from M365 and Teams to GitHub Copilot and Bing Copilot. We are the team building the Azure OpenAI service, AI Foundry, Azure ML Studio, Cognitive Services, and the global Azure infrastructure for managing the GPU and NPU capacity running the largest AI workloads on the planet. ?
One of the major, mature offerings of AI Platform is Azure ML Services. It provides data scientists and developers a rich experience for defining, training, fine-tuning, deploying, monitoring, and consuming machine learning models. We provide the infrastructure and workload management capabilities powering Azure ML Services, and we engage directly with some of the major internal research and applied ML groups using these services, including?Microsoft Research and the Bing WebXT team.
As part of AI Platform, the AI Infra team is looking for a Software Engineer II - AI Infrastructure (Scheduler) - CoreAI, with initial focus on the Scheduler subsystem. The scheduler is the "brains" of the AI Infra control plane. It governs access to the GPU and NPU capacity of the platform according to a complex system of workload preference rules, placement constraints, optimization objectives, and dynamically interacting policies aimed to maximize hardware utilization and fulfill greatly varying needs of users and the AI Platform partner services in terms of workload types, prioritization, and capacity targeting flexibility. The scheduler's set of capabilities is broad and ambitions. It manages quota, capacity reservations, SLA tiers, preemption, auto-scaling, and a wide range of configurable policies. Global scheduling is a distinctive major feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, which greatly increases capacity availability and utilization for major classes of ML workload. We have achieved this capability without allowing a major global single point of failure, based on regional instances of the scheduler service interacting via peer-to-peer protocols for sharing capacity inventory and coordinating handoff of jobs for scheduling. Our system manages significant amount of GPU capacity even outside Azure datacenters, through a unified model and operational process and highly generalized, flexible workload scheduling capabilities.
To be able to manage the inherent complexity of the Scheduler subsystem and enable it to meet the stringent expectations of high service reliability, availability, and throughput, we emphasize rigorous engineering, utmost precision and quality, and ownership-from feature design to livesite. Quality mindset, attention to detail, development process rigor, and data-driven design and problem-solving skills are key for success in our mission-critical control plane space.
Responsibilities
-
Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing.?
-
Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters.?
-
Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security.
-
Provide operational support and DRI (on-call) responsibilities for the service.?
-
Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers.?
-
Collaborate closely with service engineers, product managers, and internal applied research and data science teams within Microsoft to build better solutions together.?
-
Investigate use of tools and cloud services and prototype solutions for problems in our control plane space.
-
Embody our culture (https://careers.microsoft.com/v2/global/en/culture) and values (https://www.microsoft.com/en-us/about/corporate-values) .
Qualifications
Required Qualifications
-
Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript
-
OR equivalent experience.
Other Requirements
-
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
-
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Preferred Qualifications
-
OOP proficiency and practical familiarity with common code design patterns
-
2+ years of experience with service development in a distributed environment, in a dev-ops role, including concurrency management and stateful resource management
-
Master's degree in Computer Science or a related technical field
-
Hands-on experience with public cloud services at the IaaS level
-
Advanced knowledge of C# and .Net
-
Proficiency with use of complex data structures and algorithms, preferably in the setting of a resource allocator/scheduler, workflow/execution orchestration engine, database engine, or similar
-
Significant experience with unit testing and writing testable code
-
Technical communication skills: verbal and written
-
First-hand experience with building large-scale, multi-tenant global services with high availability
-
Experience with building and operating "stateful" and critical control plane services; handling challenges with data size and data partitioning; related use of a NoSQL cloud database
-
Experience with mapping complex object models to relational and non-relational datastores
-
Dev-ops experience with microservices architecture?in a complex infrastructure and operational environment
-
Service reliability and fundamentals engineering; instrumentation for KPIs or performance analysis; demonstrated service and code quality mindset
-
Performance engineering: work on scalability, profiling; CPU, memory and I/O use optimization techniques
-
Applied knowledge of Kubernetes: service model, workload packaging and deployment, programmatic extensibility (CRDs, operators); or equivalent knowledge of Service Fabric
-
Server-side Windows programming and performance engineering
-
Data analytics skills, in particular with Kusto
-
Experience working in a geo-distributed team
#AIPLATFORM
#AICORE
Software Engineering IC3 - The typical base pay range for this role across the U.S. is USD $100,600 - $199,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $131,400 - $215,400 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here:
https://careers.microsoft.com/us/en/us-corporate-pay
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations. (https://careers.microsoft.com/v2/global/en/accessibility.html)
-
Software Engineer II
7 days ago
Redmond, WA, United States Service Global Full timeIron Systems is an innovative, customer-focused provider of custom-built computing infrastructure platforms such as network servers, storage, OEM/ODM appliances & embedded systems. For more than 15 years, customer have trusted us for our innovative problem solving combined with holistic design, engineering, manufacturing, logistic and global support...
-
Software Engineer II
3 days ago
Redmond, WA, United States ICONMA Full timeOur Client, an Internet Content and Information company, is looking for a Software Engineer II for their Remote location. Responsibilities: Design and implement core, backend software components Code using primarily PHP , Hack, potentially C/C++, Java as well Interface with other teams, team members to incorporate their innovations and vice versa Conduct...
-
Software Engineering
20 hours ago
Redmond, WA, United States Apex Systems Full timeJOB TITLE: Software Engineer IIDURATION: 6 MONTHSLOCATION: ONSITE IN REDMOND, WAPAY RATE: $62 - $72 HOURLYJob Description: The main function of a software engineer is to apply the principles of computer science and mathematical analysis to the design, development, testing, and evaluation of the software and systems that make computers work. A typical...
-
Software Engineering
1 day ago
Redmond, WA, United States Apex Systems Full timeJOB TITLE: Software Engineer IIDURATION: 6 MONTHSLOCATION: ONSITE IN REDMOND, WAPAY RATE: $62 - $72 HOURLYJob Description: The main function of a software engineer is to apply the principles of computer science and mathematical analysis to the design, development, testing, and evaluation of the software and systems that make computers work. A typical...
-
Software Engineering
5 hours ago
Redmond, WA, United States Apex Systems Full timeJOB TITLE: Software Engineer IIDURATION: 6 MONTHSLOCATION: ONSITE IN REDMOND, WAPAY RATE: $62 - $72 HOURLYJob Description: The main function of a software engineer is to apply the principles of computer science and mathematical analysis to the design, development, testing, and evaluation of the software and systems that make computers work. A typical...
-
Software Engineer
3 days ago
Redmond, WA, United States Talent Software Services Full timeSoftware Engineer Job Summary: Talent Software Services is in search of a Software Engineer for a contract position in Redmond, WA. The opportunity will be seven months with a strong chance for a long-term extension. Position Summary: The main function of a Lab/Test Engineer at this level is to apply configuration skills at an intermediate to high level. The...
-
Gameplay Software Engineer II
1 week ago
Redmond, WA, United States Service Global Full timeIron Systems is an innovative, customer-focused provider of custom-built computing infrastructure platforms such as network servers, storage, OEM/ODM appliances & embedded systems. For more than 15 years, customer have trusted us for our innovative problem solving combined with holistic design, engineering, manufacturing, logistic and global support...
-
Software Engineer II- Backend
5 days ago
Redmond, WA, United States Microsoft Corporation Full timeOverviewOneDrive and SharePoint are rapidly growing services at the center of Microsoft's cloud, interacting with almost every part of Microsoft. You would be a part of a team that is fundamentally changing the way use and interact with the most important content for their home, work, and school. We are looking for a Software Engineer II- Backend. You will...
-
Software Engineer II, Minecraft
3 days ago
Redmond, WA, United States Microsoft Corporation Full timeAt Mojang Studios, the creators of Minecraft, we are on a mission to build a better world through the power of play. The Minecraft Online team is looking for a Software Engineer II to help build the engine and systems that power a diverse set of Minecraft Online E xperiences over a dozen platforms. As a Software Engineer in this area, you will define...
-
Software Engineer II, Minecraft
7 days ago
Redmond, WA, United States Microsoft Corporation Full timeAt Mojang Studios, the creators of Minecraft, we are on a mission to build a better world through the power of play. The Minecraft Online team is looking for a Software Engineer II to help build the engine and systems that power a diverse set of Minecraft Online E xperiences over a dozen platforms. As a Software Engineer in this area, you will define...