Site Reliability Engineer

2 weeks ago


Sunnyvale, United States Blue Ribbon Global Technologies Full time
Job DescriptionJob Description

Note :Hello The manager would like to hold a Supplier Call to better explain the specifics they are looking for in this role. Please do not add anyone until after this call. I have scheduled this call for Friday 9/6/24 @ 1pm EST. Description:

This is a Site Reliability Engineer Role for Sam's Cash Application team.

Role and Responsibilities include:

  • Production Tickets handling and Troubleshooting : Requires knowledge of: Strong Analytical and problem solving skills; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
  • Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
  • Disaster Recovery Planning: Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
  • Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
  • To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
  • Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.

Data Reporting and Metrics:

  • Advanced SQL skills to pull complex data report from multiple sources, familiar with Databricks or GCP Big Query, capable to write advanced "Splunk" queries to join multiple indices to stitch data, using Data-Driven decision-making process to analyze the impact of the production issues and prioritize them.

Additional Information:

What project or initiative will they be working on?

  • Sam's Cash Reward Project

Will this role be hybrid?

  • Yes

If hybrid, how many days per week will need to be in office?

  • 2-3 times a week

Top 3 Skills Needed or Required

  • Strong technical analytical and problem solving skills , experiences on triaging and Troubleshooting Production Issues;
  • Monitoring and Alerting Skills ((Splunk, Prometheus, Grafana)
  • Data Reporting and Metrics Skills (SQL,Python, Pyspark, Databricks).

What is the makeup of the team?

  • Team of 8 engineers including Java backend engineers, Site Reliability Engineer and Data Engineers, supporting Sam's Cash Core Application Operations.

Additional Job Details

  • Location can be Sunnyvale, CA, Bentonville, AR, or Dallas, TX

Required Skills : Grafana
Additional Skills : Cloud Developer

  • Sunnyvale, United States Capgemini Engineering Full time

    Site Reliability Engineer - Infra and DevOpsJob location: Sunnyvale, CA (Onsite)Job description:Capgemini is seeking a hardworking Site Reliability Engineer to join our versatile team in Sunnyvale, CA. This outstanding opportunity allows you to work on world-class infrastructure and DevOps projects, aimed at flawless software performance and reliability. If...


  • Sunnyvale, California, United States Apple Full time

    About the RoleWe are seeking a highly skilled Site Reliability Engineer to join our team at Apple. As a key member of our Manufacturing Systems & Infrastructure (MSI) team, you will play a critical role in maintaining and enhancing the reliability of our production systems.Key ResponsibilitiesDesign, develop, and maintain scalable, reliable, and efficient...


  • Sunnyvale, United States Apple Full time

    Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. The people here at Apple don’t just create products — they create the kind of wonder that’s revolutionized entire...


  • Sunnyvale, California, United States Motion Recruitment Full time

    Location: 100% RemoteEmployment Type: Full TimeSalary Range: $140k - $160kA prominent Managed Service Provider with a robust clientele at an enterprise level is seeking to add a full-time remote Site Reliability Engineer to their skilled team.This organization manages over 20,000 servers, developing innovative infrastructures and projects from the ground...


  • Sunnyvale, United States SIS Full time

    Site Reliability EngineerSIS - Sunnyvale, CAType: Description:Role:Site Reliability EngineerTerms:12mos+Loc:Sunnyvale,CA Skill SetsKafka - At least 1 year Is RequiredAWS -At least 1 year Is RequiredMongoDB - At least 1 year Is RequiredCore Java - 5-10 years Is RequiredElastic Search - At least 1 year Is Required Skills: Skilled at writing clean,...


  • Sunnyvale, California, United States Motion Recruitment Partners LLC Full time

    Company OverviewA prominent Managed Service Provider with extensive partnerships and a diverse clientele at an enterprise level is seeking a dedicated full-time remote Site Reliability Engineer. This organization manages a vast array of servers, exceeding 20,000, to create innovative infrastructures and projects from the ground up.Role OverviewAs a Senior...


  • Sunnyvale, California, United States Motion Recruitment Partners LLC Full time

    About the CompanyA prominent Managed Service Provider, recognized for its extensive partnerships and enterprise-level clientele, is seeking a dedicated full-time remote Site Reliability Engineer. This organization manages a vast array of over 20,000 servers, developing innovative infrastructures and projects from the ground up.Role OverviewAs a Senior Site...


  • Sunnyvale, United States Apple Full time

    Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. The people here at Apple don't just create products - they create the kind of wonder that's revolutionized entire...


  • Sunnyvale, California, United States Motion Recruitment Full time

    Location: 100% RemotePosition Type: Full TimeSalary: $140k - $160kA prominent Managed Service Provider with extensive partnerships and a diverse client base is seeking to add a full-time remote Site Reliability Engineer to their skilled team.This organization manages over 20,000 servers, developing innovative infrastructures and projects from the ground...


  • Sunnyvale, California, United States Motion Recruitment Full time

    Location: 100% RemoteEmployment Type: Full TimeSalary Range: $140k - $160kA prominent Managed Service Provider, recognized for its extensive partnerships and enterprise-level clientele, is seeking a dedicated full-time remote Site Reliability Engineer.This organization manages a vast array of over 20,000 servers, focusing on developing innovative...


  • Sunnyvale, California, United States Red Oak Technologies Full time

    Company Overview:Red Oak Technologies is a premier provider of comprehensive staffing solutions across various sectors, including Information Technology, Marketing, Finance, Business Operations, Manufacturing, and Engineering. Our expertise lies in swiftly sourcing and effectively aligning top-tier professional talent with clients who require highly skilled...


  • Sunnyvale, California, United States Red Oak Technologies Full time

    Company Overview:Red Oak Technologies is a premier provider of extensive resourcing solutions across diverse industries, including Information Technology, Marketing, Finance, Business Operations, Manufacturing, and Engineering. Our expertise lies in swiftly sourcing and effectively aligning top-tier professional talent with clients who require highly skilled...


  • Sunnyvale, United States Red Oak Technologies Full time

    Red Oak Technologies is a leading provider of comprehensive resourcing solutions across a variety of industries and sectors including IT, Marketing, Finance, Business Operations, Manufacturing and Engineering. We specialize in quickly acquiring and efficiently matching top-tier professional talent with clients in immediate need of highly skilled contract,...


  • Sunnyvale, United States Red Oak Technologies Full time

    Red Oak Technologies is a leading provider of comprehensive resourcing solutions across a variety of industries and sectors including IT, Marketing, Finance, Business Operations, Manufacturing and Engineering. We specialize in quickly acquiring and efficiently matching top-tier professional talent with clients in immediate need of highly skilled contract,...


  • Sunnyvale, California, United States NetApp Full time

    About the RoleThe Site Reliability Engineering Manager will lead a dynamic team responsible for ensuring the reliability, performance, and efficiency of our critical systems.Key ResponsibilitiesLead and mentor a team of SREs, fostering a culture of continuous improvement and innovation.Collaborate with product and engineering teams to design and implement...


  • Sunnyvale, United States Apple Full time

    Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job and there's no telling what you could accomplish. The people Reliability Engineer, Liability, Reliability, Reliability, Engineer, Technology


  • Sunnyvale, California, United States Info Way Solutions Full time

    Java Engineer with Site Reliability Expertise Location: Sunnyvale, CA (Day 1 Onsite) Job Overview: As a Java Engineer with a focus on Site Reliability, you will be responsible for developing robust, efficient, and testable code in Java. Your expertise will play a crucial role in the architecture and deployment of large-scale distributed systems,...


  • Sunnyvale, California, United States Red Oak Technologies Full time

    Company Overview:Red Oak Technologies stands as a premier provider of extensive resourcing solutions across diverse industries, including Information Technology, Marketing, Finance, Business Operations, Manufacturing, and Engineering. Our expertise lies in swiftly sourcing and effectively aligning top-tier professional talent with clients in urgent need of...


  • Sunnyvale, California, United States Red Oak Technologies Full time

    Company Overview:Red Oak Technologies stands at the forefront of delivering comprehensive staffing solutions across diverse sectors such as Information Technology, Marketing, Finance, Business Operations, Manufacturing, and Engineering. Our expertise lies in swiftly identifying and effectively aligning top-tier professional talent with organizations in...


  • Sunnyvale, California, United States Apex Systems Full time

    Apex Systems, a leading IT staffing agency, is seeking a talented professional for a pivotal role in our organization.Position: Golang/SRE/Cloud DeveloperKey Responsibilities:Engage in application and infrastructure development utilizing Golang on Site Reliability Engineering platforms.Participate in projects that require horizontal implementation across...