RaaS Chaos Engineer
2 weeks ago
JD:
We are committed to building resilient and reliable systems that can withstand the test of time. We understand the importance of chaos engineering in ensuring the stability and performance of our products, and we are seeking an experienced Chaos Engineering Engineer to join our team. The ideal candidate will have a strong background in building chaos tools using Golang, GraphQL, and Elasticsearch, and will be passionate about improving the reliability and fault-tolerance of our systems.
• Design, develop, and maintain chaos engineering tools using Golang, GraphQL, and Elasticsearch to inject faults and simulate failure scenarios in our systems
• Collaborate with cross-functional teams to identify potential weaknesses in our infrastructure and applications and develop mitigation strategies to prevent outages and performance degradation.
• Develop and implement chaos experiments to validate the effectiveness of our systems under various failure conditions.
• Work closely with the engineering, operations, and QA teams to ensure that our chaos engineering practices are aligned with the overall objectives of our organization.
• Analyse system performance and incident data to continuously improve the reliability and resilience of our systems
• Participate in on-call rotations to provide support for production incidents and ensure the smooth operation of our services
• Stay current on industry trends and advancements in chaos engineering, and continuously explore opportunities to enhance our tools and processes
Requirements:
Bachelor's degree in computer science, Software Engineering, or a related field
4+ years of experience in chaos engineering, reliability engineering, or a similar role
Hands on experience on HA and DR simulations
Strong proficiency in Golang and experience building chaos tools using Golang, GraphQL, and Elasticsearch
In-depth understanding of distributed systems, microservices architecture, and containerization technologies (such as Docker and Kubernetes)
Knowledge of best practices in software development, testing, and deployment, including CI/CD pipelines and automation tools
Familiarity with monitoring and observability tools (such as Prometheus, Grafana, and Elastic Stack)
Excellent problem-solving skills, with the ability to troubleshoot complex issues and develop effective solutions
Strong communication and collaboration skills, with the ability to work effectively in a fast-paced, team-oriented environment
Preferred Qualifications:
Experience with additional programming languages (such as Python, Java, or C++)
Familiarity with cloud platforms (such as AWS, GCP, or Azure)
Experience in SRE (Site Reliability Engineering) or similar roles, working with large-scale, distributed systems
Certification in Chaos Engineering (such as the Certified Chaos Engineering Professional) or other relevant industry certifications