Infrastructure Reliability Engineer

1 week ago


San Jose, California, United States Western Digital Full time
Job Overview

Company Overview:
At Western Digital, we are dedicated to driving global innovation and redefining the limits of technology, transforming what was once deemed impossible into reality.

As a company rooted in problem-solving, we empower individuals to achieve remarkable feats through advanced technology. Our innovations have played a pivotal role in monumental achievements, including supporting space exploration.

We collaborate with some of the most dynamic and rapidly expanding organizations worldwide. Our contributions span from enhancing competitive gaming platforms to developing systems that enhance urban safety and connectivity in vehicles, as well as powering the data centers that support many of the largest corporations and public cloud services. Western Digital is at the forefront of creating a smarter, more connected future.

Whether you're binge-watching your favorite series, engaging on social media, or shopping online, Western Digital is the backbone of the storage infrastructure that supports these experiences. Our products, including flash memory cards, are designed to capture and preserve your most cherished moments.

We provide a comprehensive array of technologies, storage devices, and platforms tailored for both businesses and consumers. Our data-centric solutions encompass the Western Digital, G-Technology™, SanDisk, and WD brands.

Position Summary:
As a Site Reliability Engineer (SRE) within the Secure Development Factory (SDF) at Western Digital, you will be integral to our engineering processes, delivering essential software development tools and infrastructure that enable our engineering teams to produce high-quality products efficiently. Your role will be crucial in ensuring the reliability, scalability, and performance of our IT infrastructure and DevOps tools.

In this position, you will lead by example, collaborating closely with engineering teams to align our initiatives with customer needs. Your technical acumen, adaptability, and commitment to excellence will be key in driving success and empowering stakeholders to accelerate product delivery while maintaining high standards of security, development velocity, stability, code quality, and overall code health.

Key Responsibilities:

  • Monitoring and Observability: Design and enhance monitoring solutions to provide real-time insights into system performance.
  • Best Practices Implementation: Advocate for and apply best practices in SRE, DevOps, and automation to boost platform stability and performance.
  • Process Automation: Spearhead automation initiatives to optimize workflows, minimize manual tasks, and enhance operational efficiency.
  • System Architecture: Contribute to the design and architecture of systems and applications, ensuring alignment with reliability and scalability objectives.
  • Technical Leadership: Assume technical ownership within the SRE team, fostering a collaborative and growth-oriented environment.
  • Reliability Ownership: Take responsibility for system reliability, achieving Service Level Objectives (SLOs) and ensuring customer satisfaction.
  • Collaborative Solutions: Work closely with engineering teams to understand customer requirements and co-develop solutions.
  • Continuous Learning: Stay abreast of emerging technologies and adapt swiftly to changing requirements and challenges.
  • Knowledge Sharing: Actively engage in upskilling and sharing knowledge within the team.
  • Team Collaboration: Foster a positive team culture through effective collaboration with colleagues.
  • Professional Conduct: Exhibit professionalism, integrity, and a commitment to ethical standards.
  • Documentation: Maintain comprehensive and organized documentation of systems and processes.

Qualifications:

  • Bachelor's degree in Computer Science, Information Technology, Electrical Engineering, or Mechanical Engineering, with 6 to 10 years of hands-on experience in DevOps tools and SRE practices.
  • Proven experience in administering DevOps tools such as Artifactory, Jenkins, Git, and security testing tools.
  • Strong understanding of server infrastructure, virtualization, storage, and networking.
  • Exceptional analytical and problem-solving skills for managing complex technology issues.
  • Extensive experience with Ansible automation, including research, writing, maintaining, and optimizing roles/playbooks/modules.
  • Proficiency in shell scripting, Python, and configuration management tools like Terraform.
  • Experience in developing and customizing CI/CD pipelines for diverse application requirements.
  • Familiarity with monitoring tools such as Icinga, Splunk, Prometheus, and Grafana.
  • Knowledge of containerization technologies like Docker and Kubernetes is advantageous.
  • Automation-first mindset with a focus on integrating security measures into systems.
  • Experience with load balancers, LDAP/SSO integration, and security endpoint configurations.
  • Familiarity with cloud computing platforms (e.g., AWS, Azure, GCP) is a plus.
  • Excellent communication and collaboration skills.

Additional Information:
Western Digital values diversity and is committed to creating an inclusive environment where every individual can thrive. We believe that diverse perspectives lead to the best outcomes for our employees, our company, and our customers.

We are dedicated to providing opportunities for applicants with disabilities and ensuring that all candidates can navigate our hiring process successfully.



  • San Jose, California, United States Western Digital Full time

    Job OverviewCompany OverviewAt Western Digital, we are driven by a vision to ignite global innovation and redefine the limits of technology, transforming the seemingly impossible into reality.Western Digital is fundamentally a collective of problem solvers. Our team has consistently achieved remarkable feats with the right technological tools. For decades,...


  • San Jose, California, United States Western Digital Full time

    Job OverviewCompany OverviewAt Western Digital, we are driven by a vision to inspire global innovation and redefine technological possibilities. Our legacy as problem solvers has empowered us to achieve remarkable feats, including contributions to monumental projects like the moon landing.As a trusted partner to leading organizations worldwide, we enhance...


  • San Francisco, California, United States Conduit Full time

    Conduit - The Onchain Compute Company Conduit is committed to simplifying onchain compute for developers and enterprises of all sizes. Our team, composed of seasoned engineers and cryptocurrency specialists, collaborates to deliver dependable and scalable solutions for onchain applications. At Conduit, we prioritize innovation, teamwork, and a dedication to...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    ABOUT BASETEN LABS, INC.We are an innovative team of developers supported by leading investors such as IVP, Spark Capital, and Sarah Guo at Conviction. Machine Learning teams at major enterprises and pioneering AI-native companies leverage Baseten to enhance their core production operations with top-tier performance, security, and dependability. Having...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    ABOUT BASETEN LABS, INC.We are an innovative team of creators supported by leading investors, including IVP, Spark Capital, and Sarah Guo at Conviction. Machine Learning teams at major enterprises and pioneering AI-native companies utilize Baseten to enhance their core production workloads with top-tier performance, security, and dependability. Having...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    ABOUT BASETEN LABS, INC.We are an innovative team of creators supported by leading investors, dedicated to transforming the landscape of machine learning infrastructure. Our platform is utilized by machine learning teams at prominent enterprises and pioneering AI-focused companies, delivering exceptional performance, security, and dependability. Having...


  • San Francisco, California, United States Swish Analytics Full time

    About Swish AnalyticsSwish Analytics is a pioneering organization in the realm of sports analytics, betting, and fantasy, dedicated to developing cutting-edge predictive analytics solutions for sports data. Our PhilosophyWe view the art of oddsmaking as a complex challenge that intertwines engineering, mathematics, and sports betting knowledge, rather than...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    ABOUT BASETEN LABS, INC.We are an innovative team of creators supported by leading investors such as IVP, Spark Capital, and Sarah Guo at Conviction. Machine Learning teams at major enterprises and pioneering AI-native organizations utilize Baseten to enhance their core production workloads with top-tier performance, security, and dependability. Having...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    ABOUT BASETEN LABS, INC.We are a dynamic team of innovators supported by leading investors such as IVP, Spark Capital, and Sarah Guo at Conviction. Machine Learning teams at major enterprises and pioneering AI-native organizations utilize Baseten to enhance their core production operations with top-tier performance, security, and dependability. Having...


  • San Francisco, California, United States BaseTen Labs, Inc. Full time

    ABOUT BASETEN LABS, INC.We are an innovative team of creators supported by leading investors. Our platform is utilized by machine learning teams across various industries, providing them with top-notch performance, security, and dependability. Having achieved product-market fit and secured significant funding, we are poised for growth in the expansive ML...


  • San Francisco, California, United States Aircon Engineering Inc Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Aircon Engineering Inc. As a Site Reliability Engineer, you will be responsible for designing, building, and operating large-scale cloud infrastructure platforms that power our business applications.Key ResponsibilitiesDesign and implement highly available and...


  • San Francisco, California, United States Okta, Inc. Full time

    Okta, Inc. is seeking a highly skilled Senior Site Reliability Engineer to join our Security Engineering team. As a key member of our team, you will play a critical role in designing and developing security solutions to harden our cloud infrastructure.We are a fast-paced organization that is poised for massive growth and success. You will act as a liaison...


  • San Jose, California, United States Zscaler Full time

    About ZscalerZscaler is a leading cloud security platform provider, offering a comprehensive suite of solutions to protect businesses from cyber threats. Our team of experts has built a robust platform that enables organizations to harness the power of the cloud while ensuring the security and integrity of their data.Job SummaryWe are seeking an experienced...


  • San Francisco, California, United States Autodesk, Inc. Full time

    Job SummaryWe are seeking a highly skilled Senior Site Reliability Engineer to lead our cloud infrastructure efforts and ensure the reliability and performance of our software solutions. As a key member of our team, you will be responsible for designing, implementing, and maintaining scalable and secure cloud infrastructure to support our growing user...


  • San Jose, California, United States VDart Inc Full time

    Job OverviewPosition: Lead Site Reliability EngineerLocation: San Jose, CA (Hybrid Work Model)Contract Duration: 6+ monthsExperience Required: 14+ YearsRole Summary:We are in search of a highly experienced and proactive Site Reliability Engineer Consultant. In this pivotal role, you will be responsible for:Key Responsibilities:Enhancing the reliability,...


  • San Jose, California, United States VDart Inc Full time

    Job OverviewPosition: Lead Site Reliability EngineerLocation: San Jose, CA (Hybrid Work Model)Contract Duration: 6+ monthsExperience Required: 14+ YearsRole Summary:We are in search of a highly experienced and proactive Site Reliability Engineer Consultant. In this capacity, you will be responsible for:Key Responsibilities:Enhancing the reliability,...


  • San Jose, California, United States Advanced Micro Devices Full time

    About the RoleWe are seeking a highly skilled Senior AI Infrastructure Software Engineer to join our team at Advanced Micro Devices (AMD). As a key member of our infrastructure team, you will play a critical role in the development and release of our inference engine, which will enable our customers to leverage high-performance AI models on top of AMD's...


  • San Francisco, California, United States Salesforce, Inc. Full time

    Cloud Infrastructure Specialist - Site Reliability Engineer LeadJob Category: Enterprise Technology & InfrastructureAbout Salesforce, Inc.We're a leading technology company, inspiring innovation and driving business growth with cutting-edge solutions. Our mission is to empower businesses to thrive in a rapidly changing world. We're committed to creating a...


  • San Jose, California, United States Microsoft Corporation Full time

    At Microsoft Corporation, the Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) team is pivotal in driving the evolution of our expansive Cloud Infrastructure, which is integral to our "Intelligent Cloud" vision. SCHIE is responsible for delivering the essential infrastructure and foundational technologies that support over 200 online services,...


  • San Jose, California, United States Hireio, Inc. Full time

    Exciting Opportunity: Data Infrastructure Site Reliability Engineering (SRE) TeamJoin Hireio, Inc., a premier platform for short-form mobile video hosting services. As a trailblazer in technology, our SRE team integrates software development with infrastructure management to architect, construct, and oversee extensive, highly distributed systems. We operate...