Design and run chaos experiments to test system reliability, fault tolerance, and recovery. Collaborate with SRE, DevOps, and Development teams to improve resilience. Identify failure points in microservices, APIs, and cloud infrastructure.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Location: St Louis, MO (3 days to office)
Note: Relocation Mandatory
Responsibilities:
Design and run chaos experiments to test system reliability, fault tolerance, and recovery.
Build automated chaos tests using tools like Gremlin, Litmus, Chaos Mesh, AWS Fault Injection Simulator, etc.
Identify failure points in microservices, APIs, and cloud infrastructure.
Collaborate with SRE, DevOps, and Development teams to improve resilience.
Document findings, create remediation plans, and drive resilience best practices.
Required Skills:
6+ years in SRE/DevOps/Platform Engineering with strong distributed systems knowledge.
Hands-on experience with chaos engineering tools (Gremlin, Litmus, FIS, Chaos Mesh).
Strong knowledge of Kubernetes, microservices, container orchestration, and cloud (AWS/Azure/GCP).
Experience with monitoring tools (Prometheus, Grafana, Datadog, Splunk).
Solid scripting skills: Python, Bash, or Go.