Current jobs related to Kafka Site Reliability Engineer - Santa Clara, California - Palo Alto Networks


  • Santa Clara, California, United States Palo Alto Networks Full time

    We are revolutionizing the cybersecurity landscape with our cloud-delivered security services, and our cloud infrastructure is rapidly expanding globally.We're seeking experienced SREs and software engineers interested in production engineering to help us scale the world's largest enterprise security cloud infrastructure.Palo Alto Networks has transformed...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team at Palo Alto Networks. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI/CD, GitOps,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RolePalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key ResponsibilitiesContribute to the success of SRE and DevOps teamsDevelop expertise in new...


  • Santa Clara, California, United States Diverse Lynx Full time

    Job Title: Site Reliability EngineerWe are seeking a highly skilled Site Reliability Engineer to join our team at Diverse Lynx LLC. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS,...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job Title: Site Reliability EngineeringWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability and scalability of our cloud-based systems.Key Responsibilities:Design and implement scalable and reliable cloud infrastructure using...


  • Santa Clara, California, United States Syntricate Technologies Full time

    Job DescriptionWe are seeking a highly skilled Site Reliability Engineer to join our team at Syntricate Technologies. As a Site Reliability Engineer, you will be responsible for ensuring the reliability, scalability, and performance of our cloud-based systems.Key Responsibilities:Design, implement, and maintain cloud infrastructure on AWS, including EC2,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and implement scalable and reliable...


  • Santa Clara, California, United States NVIDIA Full time

    Unlock the Power of Cloud ServicesWe are seeking a highly motivated Site Reliability Engineer to join our Applications Infrastructure organization.This team is responsible for automating, deploying, and maintaining infrastructure for various NVIDIA AI workflows and applications such as Metropolis, ACE, and Riva hosted in the cloud.The SRE role focuses on...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will be responsible for designing, building, and maintaining scalable and reliable infrastructure for our cloud-based products.Key Responsibilities:Design and implement scalable and reliable infrastructure for...


  • Santa Clara, California, United States NVIDIA Full time

    As a Senior Manager in Site Reliability Engineering (SRE) at NVIDIA, you will lead a team dedicated to the design, construction, and maintenance of expansive production systems, emphasizing high efficiency and availability. This role spans various domains, including software and systems engineering, cloud-scale storage, data management, and services. SRE...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Principal Site Reliability Engineer to join our team. As a key member of our engineering team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.You will work closely with our development team to ensure that applications are production-ready,...


  • Santa Clara, California, United States NVIDIA Full time

    About NVIDIANVIDIA is a leader in the field of artificial intelligence, machine learning, and datacenter acceleration. Our company has a rich history of innovation, with a legacy that dates back to the invention of the GPU in 1999. This groundbreaking technology sparked the growth of the PC gaming market, redefined modern computer graphics, and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Principal Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Global Customer Operation Team, you will be responsible for designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Key ResponsibilitiesDesign and...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Site Reliability Engineer to join our team. As a Site Reliability Engineer, you will play a critical role in designing, building, maintaining, and scaling production services and server farms within our FedRAMP SASE product portfolio.Your ResponsibilitiesDesign and Implement Scalable...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job OverviewPalo Alto Networks is seeking a highly skilled Cloud Infrastructure Engineer to join our CDL/SLS team. As a Senior Staff Site Reliability Engineer, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our team is at the forefront of innovation, constantly pushing the boundaries of what is...


  • Santa Clara, California, United States NVIDIA Full time

    NVIDIA is a leader in AI, machine learning, and datacenter acceleration. Our company is expanding its leadership into datacenter networking with ethernet switches, NICs, and DPUs. We have continuously reinvented ourselves over two decades, with our invention of the GPU in 1999 sparking the growth of the PC gaming market, redefining modern computer graphics,...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About the RoleWe are seeking a highly skilled Senior Staff Site Reliability Engineer to join our team at Palo Alto Networks. As a key member of our Cloud Infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Our ideal candidate will have a strong background in cloud computing, with...


  • Santa Clara, California, United States Palo Alto Networks Full time

    About UsPalo Alto Networks is a leader in the cybersecurity industry, dedicated to protecting the digital way of life. Our mission is to be the cybersecurity partner of choice, and we're looking for innovators who share our passion for shaping the future of cybersecurity.We're a company built on disruption, and we're looking for individuals who are...


  • Santa Clara, California, United States Palo Alto Networks Full time

    Job DescriptionPalo Alto Networks is seeking a highly skilled Senior Staff Site Reliability Engineer to join our CDL/SLS team. As a key member of our infrastructure team, you will be responsible for designing, building, and operating reliable and secure cloud infrastructure.Key Responsibilities:Develop expertise in new technologies and contribute to the...


  • Santa Clara, California, United States Software Technology, Inc Full time

    Job Title: Service Reliability EngineerSoftware Technology, Inc is seeking a highly skilled Service Reliability Engineer to join our team.Key Responsibilities:Develop and implement DevOps practices to ensure 24x7 SaaS operationCollaborate with micro-service software developers, architects, and field integration resources to architect and deliver Ericsson's...

Kafka Site Reliability Engineer

2 months ago


Santa Clara, California, United States Palo Alto Networks Full time
Job Title: Principal Kafka Site Reliability Engineer DevOps

We are revolutionizing the cybersecurity landscape with our cloud-delivered security services, and our cloud infrastructure is rapidly expanding with a global presence.

We're seeking exceptional SREs and software engineers interested in production engineering to help us scale the largest enterprise security cloud infrastructure worldwide.

About Us

Palo Alto Networks has disrupted the enterprise firewall market, growing from a start-up to a multi-billion-dollar company.

Our Application Framework, the latest offering in our cloud-delivered security services, ingests security events from hundreds of thousands of firewalls deployed globally to provide a massive data analytics platform for deep inspection, anomaly detection, and actionable security automation.

Our cloud infrastructure is home to complex distributed systems and virtualization software platforms that enable big data processing for security services, sandboxing, and malware detection, URL categorization, and malicious site/domain identification, and security research/response.

Responsibilities
  1. Maintain and scale production Kafka clusters with high ingestion rates, Zookeeper clusters, and other big data pipeline systems such as Kafka and HDFS.
  2. Improve scalability, service reliability, capacity, and performance.
  3. Write automation code for managing, monitoring, measuring, expanding, and healing clusters.
  4. Participate in the occasional on-call rotation supporting the infrastructure.
  5. Roll up your sleeves to troubleshoot incidents, formulate theories, and test hypotheses to find the root cause.
Qualifications
  • Hands-on experience with managing production Kafka clusters.
  • Strong development/automation skills, with a focus on Python.
  • In-depth understanding of Kafka cluster management, Zookeeper, partitioning, topic replication, and mirroring.
  • Excellent grasp of monitoring and metrics collection, performance tuning, and troubleshooting complicated distributed systems.
  • Tools-first mindset, with a focus on building tools for efficiency and ease of use.

We're looking for organized, focused individuals who can build, improve, resolve, and deliver. Good communication skills, teamwork, and a character of taking ownership are essential.