Principal Infrastructure Sre

2 weeks ago


Santa Clara, United States NVIDIA Full time

NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work.

We are seeking a highly skilled Senior Staff Infrastructure Performance Engineer to join our dynamic team. Our company is at the forefront of technological innovation, and we are dedicated to driving efficiency and optimizing the performance of our infrastructure both on-prem and cloud. Join us in this exciting endeavor

What you will be doing:
- Lead initiatives to transform IT Compute platform architecture to build new service offerings across On-Prem & Cloud.- Define and implement metrics to measure the efficiency of compute platforms & services and drive efficiency.- Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes.- Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring.- Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.

What we need to see:
- Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience- 12+ years of proven experience in compute platform engineering with a focus on automation.- Experience with design and deployment of virtualization architectures, including VMware, Openshift or KubeVirt platforms.-
- Strong analytical skills with the ability to define and track key performance metrics.- Experience in developing tools for data analysis and performance profiling, Development with Terraform, Config Management tools.- Proficiency in programming languages such as Go and/or Python.- Experience with running large environments consisting of BareMetal, large scale virtualized environment with a mix of tens of thousands of VM’s and cloud infrastructure.

Ways to stand out from the crowd:
- Deep understanding of other infrastructure components like Storage, DNS, AD, Security Tools etc..- Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.- Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools.- Understanding of AI ops and how to leverage LLMs to automate various optimization initiatives

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you

The base salary range is 196,000 USD - 310,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and
benefits

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.


  • DevOps SRE

    1 week ago


    Santa Clara, United States Palo Alto Networks Full time

    Palo Alto Networksis the fastest-growing security company in history. We offer the chance to be part of an important mission: ending breaches and protecting our way of digital life. If you are a motivated, intelligent, creative, and hardworking individual, then this job is for you!NOT YOUR PARENTS IT!The traditional IT organization is a thing of the past. As...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Description Your Career Palo Alto Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal DevOps Engineer for the PanOS Cloud Components team, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, observability, troubleshooting,...


  • Santa Clara, United States NVIDIA Full time

    NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI,...

  • SRE Engineer

    4 weeks ago


    Santa Clara, United States Omega Solutions Full time

    SRE Engineer - 01 Positions St Louis, MO (Onsite from day 1) Client Required Skills: •Bachelor's Degree in Computer Science, Computer Systems, Information Technology or related. Equivalent experience is acceptable. •Experience with web applications and distributed systems infrastructure. •Excellent verbal and written communication to a variety of...


  • Santa Clara, United States Palo Alto Networks Full time

    Job Description Your Career We are seeking an automation savvy Senior Principal QA Engineer as we scale the Prisma Access Test team. We are looking for a strong technical leader who takes ownership of their areas of focus and who are driven to solve problems at every level. Collaboration and teamwork are at the foundation of our culture and we need...


  • Santa Clara, United States Oracle Full time

    Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design an Reliability Engineer, Liability, Principal, Engineer, Reliability, Reliability, Manufacturing, Technology


  • Santa Clara, CA, United States Nvidia Full time

    NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. It is a unique legacy of innovation that’s fueled by great technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, generative AI,...


  • Santa Clara, United States NVIDIA Full time

    Senior Production SRE Engineer - Storage page is loaded Senior Production SRE Engineer - Storage Apply locations US, CA, Santa Clara US, Remote time type Full time posted on Posted 4 Days Ago job requisition id JR1980966 Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Description Your Career We are seeking an automation savvy Senior Principal QA Engineer as we scale the Prisma Access Test team. We are looking for a strong technical leader who takes ownership of their areas of focus and who are driven to solve problems at every level. Collaboration and teamwork are at the foundation of our culture and we need...


  • Santa Clara, United States NVIDIA Full time

    Principal Infrastructure Performance and Development Engineer page is loaded Principal Infrastructure Performance and Development Engineer Apply locations US, CA, Santa Clara time type Full time posted on Posted Yesterday job requisition id JR1981842 Joining NVIDIA's AI Efficiency Team means contributing to the infrastructure that powers our leading-edge AI...

  • Senior Manager

    4 weeks ago


    Santa Clara, United States NVIDIA Full time

    As a Sr Manager in Site Reliability Engineering (SRE), you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE Senior Managers...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Description Your Career We are looking for a Principal DevOps/SRE to operate in production a large scale GCP cloud running our innovative SaaS cyber-security product, while continuously improving application deployment, monitoring, operability and uptime of the service. The Cortex XDR group specializes in analysis and visualization of complex...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job Description Your Career In a world where remote work is the new norm, organizations’ perimeters are much more loosely defined and cloud-native apps replace data centers rapidly, a new approach is needed to provide connectivity, compliance and security for all. Prisma Access (formally GlobalProtect Cloud Service) provides protection straight from the...


  • Santa Clara, United States Pure Storage Full time

    Company Overview: **BE PART OF BUILDING THE FUTURE.** What do NASA and emerging space companies have in common with COVID vaccine R&D teams or with Roblox and the Metaverse? The answer is data, - all fast moving, fast growing industries rely on data for a competitive edge in their industries. And the most advanced companies are realizing the full data...


  • Santa Clara, United States Oracle Full time

    Are you interested in working on the Storage Infrastructure team that operates with Exabytes of data in 60 regions? Oracle Cloud Infrastructure (OCI) customers run their businesses on our cloud, and our mission is to provide them with industry-leadin Storage, Technical, Control, Staff, Principal, Manufacturing, Technology


  • Santa Clara, United States Palo Alto Networks Full time

    This position is for the leader of a data network team responsible for the design, deployment, support and stewardship of enterprise-class global networks. This is a senior leadership level position requiring a strong background in Network, Voice, Video and Unified Communications technologies. This leader is in charge of maintenance and operations of...


  • Santa Clara, California, United States Motion Recruitment Full time

    This cybersecurity company in the Bay Area provides optimized access and real time security for people, devices, and data. They help customers reduce risk, accelerate performance, and get visibility into any cloud, web, and private application activity. They are looking to bring on a Senior Staff/Principal Software Engineer for a full time, remote role. This...


  • Santa Clara, United States Astera Labs Full time

    Astera Labs is a global leader delivering semiconductor-based connectivity solutions purpose-built to unleash the full potential of intelligent data infrastructure at cloud-scale. Our class-defining first-to-market products based on PCIe, CXL, and Ethernet technologies deliver critical connectivity for high-value artificial intelligence and machine learning...


  • Santa Clara, CA, United States Nvidia Full time

    We are seeking a Vice President, Product Management, AI Cloud Infrastructure for DGX Cloud. If you are a customer-focused, technically adept leader driven by a passion for fostering innovation in AI and cloud computing and thrive in a multifaceted, collaborative environment, join our team in shaping the future of AI and cloud technologies, delivering...


  • Santa Clara, United States Astera Labs Full time

    Astera Labs is a global leader in purpose-built connectivity solutions that unlock the full potential of cloud and AI infrastructure. Our Intelligent Connectivity Platform integrates PCIe, CXL and Ethernet semiconductor-based solutions based on a software-defined architecture that is both scalable and customizable. Inspired by trusted partnerships with...