Reliability Engineer

5 days ago


Tampa, Florida, United States Data Management Group Full time
Job Description

We are seeking a highly skilled Reliability Engineer to join our team at Data Management Group. As a key member of our Technology, Infrastructure & Operations teams, you will play a critical role in ensuring the reliability and performance of our cloud-based services and applications.

Key Responsibilities
  • Develop and Maintain Comprehensive Monitoring Solutions: Design and implement monitoring solutions for cloud-based services and applications, ensuring real-time insights into system performance and health.
  • Configure Monitoring Tools and Systems: Set up monitoring tools and systems to collect relevant metrics, logs, and traces, providing a clear understanding of system performance and capacity.
  • Create Custom Dashboards and Reports: Develop custom monitoring dashboards and reports using Splunk, DataDog, DynaTrace, or other tools, to provide actionable insights for stakeholders.
  • Proactively Identify and Address Performance Issues: Continuously monitor cloud infrastructure performance and capacity, anticipating and addressing potential scalability issues to ensure system reliability and resilience.
  • Automate Tasks to Streamline Operational Processes: Work on automating tasks to reduce manual intervention and improve operational efficiency.
  • Collaborate with Cross-Functional Teams: Collaborate with cross-functional teams to investigate and resolve critical incidents, ensuring minimal impact on end-users.
  • Work with Problem Management Team: Participate in post-mortem analysis of incidents to identify root causes and implement preventive measures.
  • Configure and Maintain Custom Dashboards and Alerts: Set up and maintain custom dashboards and alerts in various monitoring tools to ensure timely notification of performance issues.
  • Develop Scripts for Monitoring: Develop scripts for monitoring using PowerShell, Python, and Shell scripting to enhance system monitoring and performance analysis.
  • Develop Metrics for Business and Technical Teams: Create metrics for both business and technical teams to determine the health of systems and inform decision-making.
  • Provide On-Call Support: Provide on-call support as needed to ensure system reliability and performance.
  • Lead Performance Engineering Efforts: Lead and coordinate performance engineering efforts for medium to large initiatives, ensuring effective implementation of performance engineering best practices.
  • Collect and Document System Performance Characteristics: Collect and document expected system performance and operational characteristics to inform system design and implementation.
  • Develop and Execute Performance Tests: Develop and execute performance tests, including load, stress, endurance, fail-over, and interoperability tests, to ensure system reliability and performance.
  • Conduct Technical Analysis of Performance Test Results: Conduct technical analysis of performance test results and production systems, providing recommendations on performance tuning, systems, and infrastructure.
  • Define Strategy for Performance Diagnostics and Monitoring: Define the strategy for enabling performance diagnostics and monitoring using Application Performance Management (APM) tools, other monitoring tools, and diagnostic techniques.
  • Collaborate with Developers: Collaborate with developers to promote the concept of performance engineering during all phases of the Software Development Life Cycle (SDLC) to detect and correct performance issues earlier in the lifecycle.
  • Lead Peer Reviews: Lead peer reviews to ensure the completeness of all test assets created.
  • Resolve Performance and Stability Issues: Resolve performance and stability issues in the performance test environment.
  • Develop Performance Engineering Work Plan: Develop performance engineering work plan structure and project schedule.
  • Review Architectural Design for Performance Risks: Review architectural design for performance risks and potential issues.
  • Prepare Capacity Analysis: Prepare capacity analysis when applicable.


  • Tampa, Florida, United States OnMed Full time

    Job OverviewPOSITION TITLE: Infrastructure Reliability EngineerREPORTS TO: Director, Service DeliveryLOCATION: This role requires presence in the designated office, with occasional travel to various operational sites as necessary.About OnMedOnMed is dedicated to enhancing access to quality, affordable, and equitable healthcare. Our innovative CareStation...


  • Tampa, Florida, United States LMI Full time

    About the RoleLMI is seeking a skilled Cloud Reliability Engineer to join our team in Tampa, Florida.Job SummaryThe Cloud Reliability Engineer will be responsible for building and maintaining IT infrastructure resources that serve the Command Digital and Artificial Intelligence Office's (CDAO) data analysis and data management requirements.Key...


  • Tampa, Florida, United States Cognizant Full time

    Site Reliability Engineering Manager (Remote)We are looking for a highly skilled professionalSite Reliability Engineering Managerto enhance our team. The ideal candidate will possess robust technical expertise in Dynatrace SRE Performance Validation and Python programming.The Site Reliability Engineering Managerwill be instrumental in maintaining the...


  • Tampa, Florida, United States SSI People Full time

    Job Title: Site Reliability EngineerLocation: RemoteKey Responsibilities:Engage with software development teams to ensure that operational requirements are effectively integrated into the Application Performance Management (APM) tools and new software releases.Establish monitoring and alerting benchmarks. Create clear and precise Service Level Objectives...


  • Tampa, Florida, United States Strive Works Full time

    Position OverviewAs a Lead Site Reliability Engineer (SRE) within Striveworks' professional services team, you will be entrusted from day one to oversee specific product implementations by managing, refining, and enhancing our on-premises and cloud-based computing infrastructures. Your role is pivotal in ensuring the successful rollout of our software...


  • Tampa, Florida, United States eTeam Full time

    Position: Site Reliability Engineer Work Arrangement: Remote Contract Duration: 6+ MonthsOverview: This role is designed for a fully remote environment, ideally suited for candidates in the Eastern or Central time zones. Occasional office visits may be required for conferences, events, or meetings as necessary. Key Qualifications: A minimum of 5 years of...


  • Tampa, Florida, United States Striveworks Full time

    Striveworks - Site Reliability EngineerThe Site Reliability Engineer at Striveworks is crucial in the deployment and upkeep of software solutions tailored for clients, ensuring seamless integration and customization. Key responsibilities encompass the automation of infrastructure-as-code, incident management, and collaboration with platform developers. The...


  • Tampa, Florida, United States Striveworks Full time

    Striveworks - Infrastructure Reliability EngineerThe Infrastructure Reliability Engineer at Striveworks is essential in the deployment and upkeep of software solutions tailored for clients, ensuring seamless integration and customization. Key responsibilities encompass the automation of infrastructure-as-code, incident management, and teamwork with platform...


  • Tampa, Florida, United States Striveworks Full time

    Striveworks - Site Reliability EngineerThe Site Reliability Engineer at Striveworks is essential in implementing and sustaining software solutions for our clients, ensuring seamless integration and customization. Key responsibilities encompass the automation of infrastructure-as-code, incident management, and collaboration with platform developers. The ideal...


  • Tampa, Florida, United States Striveworks Full time

    Striveworks - Site Reliability EngineerThe Site Reliability Engineer at Striveworks is essential in the deployment and maintenance of software solutions tailored for clients, ensuring seamless integration and customization. Key responsibilities encompass the automation of infrastructure-as-code, incident management, and collaboration with platform...


  • Tampa, Florida, United States Striveworks Full time

    Striveworks - Site Reliability EngineerThe Site Reliability Engineer at Striveworks is essential in implementing and overseeing software solutions for clients, ensuring effective integration and tailored customization. Key responsibilities encompass the automation of infrastructure-as-code, managing incident responses, and collaborating closely with platform...


  • Tampa, Florida, United States Striveworks Full time

    Striveworks - Site Reliability EngineerThe Site Reliability Engineer at Striveworks is essential in the deployment and maintenance of software solutions tailored for clients, ensuring effective integration and customization. Key responsibilities encompass the automation of infrastructure-as-code, incident management, and collaboration with platform...


  • Tampa, Florida, United States Striveworks Full time

    Become a pivotal member of our team as a Lead Site Reliability Engineer (SRE) at StriveworksStriveworks is in search of a skilled Lead Site Reliability Engineer to oversee, refine, and elevate our software deployments across both on-premises and cloud infrastructures. In this critical role, you will ensure the effective deployment of our software solutions...


  • Tampa, Florida, United States Striveworks Full time

    Join Striveworks as a Lead Site Reliability EngineerStriveworks is seeking a skilled Lead Site Reliability Engineer (SRE) to oversee, refine, and enhance our software deployments across both on-premises and cloud platforms. In this pivotal role, you will ensure the effective deployment of our software solutions to our clients, collaborating closely with a...


  • Tampa, Florida, United States Actalent Full time

    Position OverviewThe Maintenance Reliability Engineer plays a crucial role in supporting operations and maintenance by troubleshooting equipment malfunctions. This position requires providing technical expertise and delivering engineered solutions to enhance equipment reliability.Key ResponsibilitiesManage smaller capital projects by offering technical...


  • Tampa, Florida, United States Gopher Resource Full time

    Position Overview: The Asset Reliability Manager is tasked with spearheading initiatives aimed at enhancing system dependability and equipment integrity. This role is pivotal in achieving optimal operational uptime, ensuring that plant machinery is adequately serviced and monitored through the implementation of PDCA feedback mechanisms. The manager fosters a...


  • Tampa, Florida, United States Striveworks Full time

    Position Overview:We are seeking a skilled Senior Site Reliability Engineer (SRE) at Striveworks, dedicated to overseeing, refining, and advancing our software deployments across both on-premises and cloud infrastructures. In this pivotal role, you will ensure the effective deployment of our software solutions to our clientele, collaborating closely with a...


  • Tampa, Florida, United States Striveworks Full time

    Position Overview:We are seeking a skilled Senior Site Reliability Engineer (SRE) at Striveworks, dedicated to overseeing, refining, and advancing our software implementations across both on-premises and cloud infrastructures. In this pivotal role, you will ensure the effective deployment of our software offerings to clients, collaborating closely with a...


  • Tampa, Florida, United States Striveworks Full time

    Position Overview:We are seeking a skilled Senior Site Reliability Engineer (SRE) at Striveworks, responsible for overseeing, refining, and advancing our software deployments across both on-premises and cloud infrastructures. In this pivotal role, you will ensure the effective deployment of our software solutions to clients while collaborating closely with a...


  • Tampa, Florida, United States Striveworks Full time

    Position Overview:We are seeking a skilled Senior Site Reliability Engineer (SRE) at Striveworks, responsible for overseeing, refining, and advancing our software deployments across both cloud and on-premises environments. In this pivotal role, you will ensure the effective deployment of our software solutions to our clientele while collaborating closely...