System Development Engineer, Annapurna Labs, Machine Learning Fleet Operations

2 days ago


Austin TX United States Amazon Full time
System Development Engineer, Annapurna Labs, Machine Learning Fleet Operations

AWS Utility Computing (UC) provides product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. Additionally, this role may involve exposure to and experience with Amazon's growing suite of generative AI services and other cutting-edge cloud computing offerings across the AWS portfolio.

Annapurna Labs (our organization within AWS UC) designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago—even yesterday. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, and deliver results that help our customers change the world.

In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Fleet Operations Team is looking for candidates interested in diving deep into our "fleet" of Machine Learning servers deployed around the world.

We are seeking an engineer who is comfortable debugging emergent problems in GPU and server hardware, writing scripts in languages such as Python, Bash and/or Golang, running large scale experiments on a fleet of complex hardware, developing data infrastructure and analyzing trends, and developing automation software to scale operations.

Key job responsibilities:

  1. Member of a team responsible for system remediation, operational excellence, and customer experience on bleeding edge ML products.
  2. Utilize data to root cause hardware failures and identify live trends on the most complex systems in AWS.
  3. Implement and improve system level testing across the product lifecycle.
  4. Develop software which can be maintained, improved upon, documented, tested, and reused.
  5. Dive deep on issues at the intersection of hardware and software.

A day in the life:

The MLA Fleet Operations team was formed to maintain an exceptionally high quality bar for our fleet of advanced machine learning server products. We perfect the customer experience by developing scalable software for rapid incident response times and data visualization as well as diving deep into hardware issues as they arise.

About the team:

Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge-sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects that help our team members develop your engineering expertise so you feel empowered to take on more complex tasks in the future.

Diverse Experiences:

AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying.

About AWS:

Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses.

Inclusive Team Culture:

Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness.

Work/Life Balance:

We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud.

Mentorship & Career Growth:

We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.

BASIC QUALIFICATIONS

- 2+ years of non-internship professional software development experience.
- 1+ years of designing or architecting (design patterns, reliability and scaling) of new and existing systems experience.
- 1+ years of administrative experience in networking, storage systems, operating systems and hands-on systems engineering experience.
- Knowledge of systems engineering fundamentals (networking, storage, operating systems).
- Experience programming with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby.
- Experience with Linux/Unix.

PREFERRED QUALIFICATIONS

- Experience building services using AWS products.

#J-18808-Ljbffr

  • Austin, United States Annapurna Labs (U.S.) Inc. Full time

    AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the...


  • Austin, United States Annapurna Labs (U.S.) Inc. Full time

    In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Release and Automation Software Team is looking for candidates interested in designing and building services and automations to improve the releases and operations of our Machine Learning servers.Have you ever wondered...


  • Seattle, WA, United States Amazon Full time

    Annapurna Labs was a startup company acquired by AWS in 2015, and is now fully integrated. If AWS is an infrastructure company, then think Annapurna Labs as the infrastructure provider of AWS. Our org covers multiple disciplines including silicon engineering, hardware design and verification, software, and operations. AWS Nitro, ENA, EFA, Graviton and F1 EC2...


  • Cupertino, CA, United States Amazon Full time

    Job ID: 2803487 | Amazon Development Center U.S., Inc. AWS Utility Computing (UC) provides product innovations — from foundational services such as Amazon’s Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2), to consistently released new product innovations that continue to set AWS’s services and features apart in the industry. As a...


  • Austin, United States Annapurna Labs (U.S.) Inc. Full time

    Job summary Amazon Web Services provides a highly reliable, scalable, low-cost infrastructure platform in the cloud that powers hundreds of thousands of businesses in 190 countries around the world.We are seeking an experienced Design Verification Engineers to build the next generation of our cloud server platforms. Our success depends on our world-class...


  • Austin, TX, United States Facebook Full time

    Summary: Reality Labs focuses on delivering Meta's vision through Virtual Reality (VR) and Augmented Reality (AR). The compute performance and power efficiency requirements of Virtual and Augmented Reality require custom silicon, software, and system-level solutions. Reality Labs Silicon team is driving the state of the art forward with breakthrough work in...


  • Seattle, WA, United States Amazon Full time

    Are you excited about Machine Learning, chip acceleration, compilers, storage, systems or EC2? Are you passionate about delivering high quality services that affect hundreds of thousands of users? We are dubbed the "secret sauce" behind AWS's success with development centers in the U.S. and Israel. Annarpuna is at the forefront of innovation by combining...


  • Austin, TX, United States Saicon Consultants Inc. Full time

    LOCATION: Austin, TX - ONSITE REQUIRED (relocation at candidate's own expense) Check you match the skill requirements for this role, as well as associated experience, then apply with your CV below. 12 month contract role W2 hourly pay (no 1099 or corp to corp arrangement) Job Title: Machine Learning Engineer/Software Developer Job Description: We are...


  • Cupertino, CA, United States Flair Labs (YC S23) Full time

    Company DescriptionFlair Labs (YC S23) is an AI intelligence company based in Cupertino, CA, focused on revolutionizing the real estate. Their mission is to empower real estate professionals with AI solutions to enhance their interactions, identify motivated buyers and sellers, and maximize outreach. Additionally, Flair Labs analyzes and automates calls,...


  • Cupertino, CA, United States Amazon Full time

    Sr. Software Development Engineer, Annapurna Labs The AWS Cloud Storage offers a complete range of hardware and software for customers to store, access, govern, and analyze their data, reducing costs, increasing agility, and accelerating innovation.AWS Cloud Storage team is hiring firmware engineers with a background in NVMe memory devices to solve our...

  • MLOps Engineer

    16 hours ago


    Austin, TX, United States Unreal Gigs Full time

    Are you passionate about bringing the best of machine learning and DevOps together to create reliable, scalable, and efficient AI systems? Do you thrive on automating machine learning pipelines, deploying models at scale, and ensuring that AI solutions deliver value in production environments? If you’re excited about optimizing the entire machine learning...


  • , OK, United States Amazon Full time

    Senior Device Driver Engineer (Team Lead), Annapurna Labs Machine Learning Accelerators, AWS AWS Utility Computing (UC) provides product innovations that continue to set AWS’s services and features apart in the industry. As a member of the UC organization, you’ll support the development and management of Compute, Database, Storage, Platform, and...


  • Austin, TX, United States Amazon Full time

    ASIC Power Engineer, Cloud-Scale Machine Learning Acceleration team Amazon Web Services provides a highly reliable, scalable, low-cost infrastructure platform in the cloud that powers hundreds of thousands of businesses in 190 countries around the world. We have data center locations in the U.S., Europe, Singapore, and Japan, and customers across all...


  • San Mateo, CA, United States Lumino Labs Inc. Full time

    About Lumino At Lumino, our mission is to unlock the power of AI for every human, and we can’t do this without having the best people in the world on the team. AI is one of the next set of technologies that will unlock vast potential of human innovation, empowering us to solve problems that were thought to be unsolvable. Lumino is a technology company...


  • Austin, TX, United States Statt Full time

    Role Description:We are seeking a Machine Learning Engineer with approximately 5 years of experience in the field. The ideal candidate will have a strong foundation in low-level machine learning skills, data science, and advanced AI techniques. You will be working with Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and other...


  • Austin, TX, United States AI Technologies LLC. Full time

    Job Overview Job ID: J36993 Specialized Area: Machine learning Job Title: Machine Learning Engineer Location: Austin, TX Duration: 11 Months Domain Exposure: Work Authorization: Client: To Be Discussed Later Employment Type: W-2 (Consultant must be on our company payroll. C2C is not allowed) The Machine Learning Engineer is responsible for designing and...


  • Austin, TX, United States Robotics Prcocess Automation, LLC Full time

    W-2 Open Positions Need to be Filled Immediately. Consultant must be on our company payroll, Corp-to-Corp (C2C) is not allowed. Candidates encouraged to apply directly using this portal. We do not accept resumes from other company/ third-party recruiters. Job Overview Job ID: J36993 Specialized Area: Machine learning Job Title: Machine Learning...


  • Austin, TX, United States Ethereum Technologies LLC Full time

    The Machine Learning Engineer is responsible for designing and supporting machine learning systems within Revionics’ analytical pricing software. You will be responsible for developing an expert-level understanding of the core scientific capabilities of Revionics’ solutions and building production grade machine learning systems to augment these...


  • Austin, TX, United States Apple Full time

    Machine Learning Engineer Austin, Texas, United States Software and Services Imagine what you could do here! The people here at Apple don’t just create products — they build the kind of wonder that’s revolutionized entire industries. It’s the diversity of those people and their ideas that inspires the innovation that runs through everything we do,...


  • Cupertino, CA, United States Amazon Full time

    ML Compiler Engineer II - Automated Reasoning Science, Annapurna Labs Job ID: 2720280 | Amazon Development Center U.S., Inc. The AWS Neuron Compiler team is actively seeking skilled compiler engineers to join our efforts in developing a state-of-the-art deep learning compiler stack. This stack is designed to optimize application models across diverse...