Site Reliability Engineer

2 days ago


remote us Epam Full time

Description We are seeking a Site Reliability Engineer (Azure) to join our team. #Not found Responsibilities As a Lead Azure SRE, you will be responsible for driving the reliability, performance, and scalability of cloud-based applications and services. Your expertise in Kubernetes, scripting, troubleshooting, and observability will be instrumental in ensuring a seamless and efficient cloud operations environment Take ownership of managing Kubernetes clusters, ensuring their reliability, scalability, and performance. Implement best practices for deploying, monitoring, and optimizing containerized applications in a cloud environment Utilize scripting skills in Python, Bash, and PowerShell to develop automation tools and streamline repetitive tasks. Automate infrastructure provisioning, deployment, and maintenance to achieve operational efficiency Demonstrate expertise in troubleshooting cloud environments, diagnosing and resolving issues to maintain high availability and performance. Implement proactive monitoring and alerting solutions to identify and address potential problems before they escalate Integrate with Azure DevOps to optimize the CI/CD pipeline, enabling continuous delivery and deployment of applications. Collaborate with development teams to streamline the release process and ensure smooth deployments Implement and maintain the modern observability stack, including tools like Grafana, Prometheus, Loki, etc. Leverage these tools to monitor the health and performance of systems and applications, enabling quick identification and resolution of incidents Requirements Kubernetes Scripting (Python, Bash, PowerShell in that order of preference) Troubleshooting in cloud environments Azure DevOps Good understanding/knowledge about modern observability stack i.e., tools like Grafana, Prometheus, Loki, etc Nice to have Experience working with Windows Knowledge of CI/CD (especially Azure DevOps) Knowledge of Istio Knowledge of GitOps tools (like ArgoCD) We offer Career plan and real growth opportunities Unlimited access to LinkedIn learning solutions International Mobility Plan within 25 countries Constant training, mentoring, online corporate courses, eLearning and more English classes with a certified teacher Support for employees initiatives (Algorithms club, toastmasters, agile club and more) Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more) Flexible work schedule and dress code Collaborate in a multicultural environment and share best practices from around the globe Hired directly by EPAM & % under payroll Law benefits (IMSS, INFONAVIT, 25% vacation bonus) Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members) 13 % employee savings fund, capped to the law limit Grocery coupons 30 days December bonus Employee Stock Purchase Plan 12 vacations days plus 4 floating days Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st) Monthly non-taxable amount for the electricity and internet bills EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.



  • remote, us Crisis Text Line Full time

    Crisis Text Line provides free, 24/7, high-quality text-based mental health support and crisis intervention by empowering a community of trained volunteers to support people in their moments of need.Our mission is at the intersection of empathy and innovation — we promote mental well-being for people wherever they are.Our vision is an empathetic world...


  • remote, us Epam Full time

    Description DESCRIPTION Are you a skilled Azure DevOps Site Reliability Engineer with a passion for ensuring business continuity and helping businesses always be near their clients? Do you have experience in optimizing and supporting OSDU deployment, performing monitoring including incidents resolution, and suggesting improvements? If so, we have an exciting...


  • remote, us Epam Full time

    Description DESCRIPTION Join EPAM as a Senior Site Reliability Engineer specializing in AWS! In this role, you'll ensure fleet services reliability and availability under the SRE model. If you have a good track record of highly scalable, distributed systems projects and previous experience working as an SRE, we'd love to hear from you. EPAM is a leading...


  • Remote, United States SS&C Technologies Holdings Full time

    Job Description Site Reliability Engineer (SRE) SS&C Technologies Locations: Remote FL, TX, GA, NC, AZ, TNAbout SS&C Technologies SS&C Technologies is a global investment and financial services software provider for the economic and healthcare industries. Named to the Fortune 1000 list as the top U.S. company based on revenue, SS&C is headquartered in...


  • remote, us Epam Full time

    Description DESCRIPTION Are you a skilled Cloud Site Reliability Engineer with experience in AWS or GCP? Do you have a passion for maintaining CI/CD frameworks, integrating observatory stacks, and supporting Cloud applications? If so, we have an exciting opportunity for you! We're currently seeking a Cloud Site Reliability Engineer to join our vibrant team....


  • remote, us Epam Full time

    Description DESCRIPTION Join EPAM as an AWS SRE. In this role, you'll collaborate with service teams to improve the reliability and efficiency of workloads and services using SRE practices. If you're a senior engineer with a good track record of highly scalable, distributed systems projects in the past 5 years, we'd love to hear from you. EPAM is a leading...


  • Remote, Oregon, United States Priority Technology Holdings, LLC Full time

    Job title: Principal Site Reliability EngineerReports to: Director, Site Reliability EngineeringDepartment: EngineeringLocation: RemoteGrade: 21About Priority:Priority Technology Holdings, Inc. is a leading financial technology company on a mission to deliver a personalized, easy-to-adopt financial toolset that accelerates cash flow and optimizes working...


  • Remote, Oregon, United States Careviso Full time

    Senior Site Reliability EngineerLocation: Remote in the United States About the Role We're looking for a Senior Site Reliability Engineer or DevOps Engineer to join our small but growing infrastructure team. You'll work alongside our existing Site Reliability team to build and maintain the systems that keep our platform reliable, secure, and observable....


  • remote, us Epam Full time

    Description DESCRIPTION Join EPAM as an AWS Cloud Site Reliability Engineer. In this role, you'll transfer security processes, manage authentication technologies, and support the implementation of a Palo Alto firewall. If you have 3+ years of experience with AWS, proficiency in designing and managing data migration processes, and superior communication...


  • Remote, Oregon, United States MinIO Full time

    MinIO is the industry leader in high-performance object storage and the company behind the world's fastest, most widely deployed object store, powering production infrastructure for more than half of the Fortune 500, including 9 of the 10 largest global automakers and all 10 of the largest U.S. banks. Our enterprise offering, AIStor, is engineered to handle...