Principal Site Reliability Engineer

1 week ago

Santa Clara, California, United States Fortinet Full time

At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers.

Our team members enjoy solving complex problems, and obsess over getting the details right. We love what we do and are proud of our work to secure clouds and container environments for thousands of b2b customers worldwide.

Our team is growing, and we are looking for engineers with passion for automation. You will help support the FortiCNAPP platform and play a key role in building, operating, and improving the FortiCNAPP Cloud Security Platform, the world's best real-time cloud-native threat detection system.

Our team develops and supports the infrastructure layers spanning our cloud accounts, network/connectivity, workload management, observability, and storage services. We build tooling to perform automated operations in order to scale the FortiCNAPP infrastructure and service. To be successful you will design, define, develop, deploy and operate internal tooling, APIs, and frameworks which streamline our workflows and automate our infrastructure.

About this role: As a Principal Site Reliability Engineer at FortiCNAPP, you will lead the design, implementation, and optimization of our highly scalable, resilient, and efficient platform infrastructure. You will drive strategic initiatives to enhance operational excellence, mentor teams, and set the standard for reliability and automation across the organization. Your expertise will shape the future of FortiCNAPP's infrastructure, ensuring it meets the demands of our customers and supports rapid growth.

Responsibilities:

Architect and implement advanced automation strategies to maximize operational efficiency and minimize toil across the FortiCNAPP platform.
Lead the design, development, and enhancement of infrastructure systems to ensure world-class scalability, resiliency, and performance.
Proactively identify and resolve complex, systemic issues through innovative automation, tooling, and architectural solutions, preventing customer-facing incidents.
Drive the evolution of monitoring, instrumentation, and observability systems to anticipate and mitigate scalability and reliability risks before they impact customers.
Champion company-wide adoption of reliability best practices, establishing key metrics, SLAs, and milestones to embed scalability and resiliency into all engineering processes.
Collaborate with cross-functional teams to define and implement industry-leading practices for infrastructure, deployment, and operational workflows.
Provide technical leadership and mentorship to engineering and operations teams, fostering a culture of reliability, automation, and continuous improvement.
Lead incident response and post-mortem processes, driving root cause analysis and implementing preventive measures.
Participate in an on-call rotation, serving as an escalation point for complex issues and guiding the team through critical incidents.
Influence strategic technology decisions, evaluating and integrating cutting-edge tools, services, and methodologies to enhance platform reliability.

Minimum Qualifications:

10+ years of DevOps/SRE experience, with at least 5 years in a senior or lead role managing production systems at scale.
Expert-level development and automation skills, with a proven track record of building sophisticated tools and workflows.
Deep expertise in Infrastructure as Code (e.g., Terraform) and supporting tools (e.g., Atlantis, ArgoCD, Flux).
Advanced experience with Kubernetes and its ecosystem (e.g., Helm, operators, Kustomize), including managing large-scale, production-grade clusters.
Extensive experience with multiple cloud providers and managed services (e.g., AWS: EKS, EC2, S3, RDS, Secrets Manager; GCP, Azure).
Proven ability to architect and operate highly reliable, fault-tolerant cloud infrastructure that supports rapid microservice deployment with robust monitoring and high availability.
Exceptional cross-team communication and leadership skills, with experience driving alignment across engineering, product, and operations teams.
Deep knowledge of large-scale system building blocks, including load balancing, distributed/cloud computing, container orchestration, and advanced monitoring/observability.
Expert understanding of cloud networking, including VPC configuration, cross-cloud connectivity, and hybrid cloud architectures.
Proficiency in one or more programming languages (e.g., Python, Go, Rust) for building tools and automation frameworks.

Preferred Qualifications:

Extensive experience designing and implementing advanced monitoring and observability systems (e.g., Prometheus, Grafana, New Relic, Datadog, OpenTelemetry).
Strong advocate for "everything as code" principles, with experience institutionalizing IaC and GitOps practices across teams.
Deep expertise in Java application servers, JVM tuning, and performance optimization for high-throughput systems.
Experience leading cross-functional initiatives to improve system reliability, such as chaos engineering, disaster recovery planning, or zero-downtime deployments.

Educational Requirements:

Bachelor or Masters degree in Computer Science, Computer Engineering or related fields.

The US base salary range for this full-time position is $202,000-$247,000. Fortinet offers employees a variety of benefits, including medical, dental, vision, life and disability insurance, 401(k), 11 paid holidays, vacation time, and sick time as well as a comprehensive leave program.

Wage ranges are based on various factors including the labor market, job type, and job level. Exact salary offers will be determined by factors such as the candidate's subject knowledge, skill level, qualifications, experience, and geographic location.

All roles are eligible to participate in the Fortinet equity program, Bonus eligibility is reviewed at time of hire and annually at the Company's discretion.

Why Join Us:

We encourage candidates from all backgrounds and identities to apply. We offer a supportive work environment and a competitive Total Rewards package to support you with your overall health and financial well-being.

Embark on a challenging, enjoyable, and rewarding career journey with Fortinet. Join us in bringing solutions that make a meaningful and lasting impact to our 660,000+ customers around the globe.

Site Reliability Engineer

1 week ago

Santa Clara, California, United States AppLab Systems, Inc Full time

Job Title: Site Reliability EngineerLocation: Santa Clara, CA - OnsiteType: ContractTechnical skillsBaremetal data center machine management tools like IPMI, Redfish, KVM etc.Automation using Jenkins, Python, Go, Bash.Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.Any familiarity with Nvidia hardware like GPU & Tegras is a plus●...
Sr Site Reliability Engineer

2 weeks ago

Santa Clara, California, United States Palo Alto Networks Full time $120,000 - $200,000

Company Description Our MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and...
Sr Site Reliability Engineer

2 weeks ago

Santa Clara, California, United States Palo Alto Networks Full time

Company DescriptionOur MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and...
Sr Staff Site Reliability Engineer

1 hour ago

Santa Clara, California, United States Palo Alto Networks Full time

Our MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and we're looking for...
Junior Site Reliability Engineer

6 days ago

Santa Clara, California, United States Lensa Full time

Lensa is a career site that helps job seekers find great jobs in the US. We are not a staffing firm or agency. Lensa does not hire directly for these jobs, but promotes jobs on LinkedIn on behalf of its direct clients, recruitment ad agencies, and marketing partners. Lensa partners with DirectEmployers to promote this job for Insight Global. Clicking "Apply...
Sr Staff Site Reliability Engineer

6 days ago

Santa Clara, California, United States Palo Alto Networks Full time

Company Description Our MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer and more secure than the one before. We are a company built on the foundation of challenging and disrupting the way things are done, and...
Reliability Engineer, Mechanical Systems, NA

3 days ago

Santa Clara, California, United States Vantage Data Centers Full time

About Vantage Data CentersVantage Data Centers powers, cools, protects and connects the technology of the world's well-known hyperscalers, cloud providers and large enterprises. Developing and operating across North America, EMEA and Asia Pacific, Vantage has evolved data center design in innovative ways to deliver dramatic gains in reliability, efficiency...
Principal DevOps Engineer

4 days ago

Santa Clara, California, United States BMC Software Full time

Description and Requirements"At BMC trust is not just a word - it's a way of life"Description And RequirementsCareerArc CodeCA-BSHybridBMC empowers nearly 80% of the Forbes Global 100 to accelerate business value, faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and...
Principal Software Development Build Engineer

1 week ago

Santa Clara, California, United States Dell Full time

Principal Software Development Build Engineer The Software Engineering team delivers next-generation application enhancements and new products for a changing world. Working at the cutting edge, we design and develop software for platforms, peripherals, applications and diagnostics — all with the most advanced technologies, tools, software engineering...
Principal Software Engineer

6 days ago

Santa Clara, California, United States ServiceNow Full time $217,500 - $380,700

Company Description It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500. Our intelligent cloud-based...

Americas

Europe

Asia / Oceania

Africa

Principal Site Reliability Engineer