Tutorial: Chaos Engineering Playbook – Test Your Resilience
Inject faults into your systems and measure recovery. Master resilience testing with Apparatus chaos tools.
What You’ll Learn
- ✅ Understand chaos engineering principles and why it matters
- ✅ Inject CPU spikes, memory exhaustion, and network latency
- ✅ Measure system behavior under stress (MTTR, degradation)
- ✅ Design resilience test scenarios
- ✅ Run chaos campaigns across clusters
- ✅ Analyze results and identify weak points
- ✅ Build resilience into your architecture
Prerequisites
- Apparatus running — Server at
http://localhost:8090 - Target application — App you want to test (running and healthy)
- Monitoring tools (optional) — APM, dashboards, logs to watch system behavior
- curl installed — For running chaos experiments
- Understanding of: Response time, availability, graceful degradation
Time Estimate
~40 minutes (setup → run experiments → analyze results)
What You’ll Build
By the end, you’ll have:
- A chaos experiment that stresses your system
- Measurement of recovery time (MTTR)
- Evidence of graceful degradation (or failures to fix)
- A reusable chaos playbook for recurring tests
Section 1: Chaos Engineering Fundamentals
What is Chaos Engineering?
Chaos engineering is the practice of deliberately breaking things in production (or production-like environments) to find weaknesses before they cause real outages.
The Principle:
“Break things on purpose to understand how they break, then fix it before it matters.”
Why It Matters
Most companies discover their system is fragile during real disasters, not during testing. Chaos engineering lets you find problems safely:
- ❌ Before chaos engineering: App crashes unexpectedly, users are affected
- ✅ With chaos engineering: You find and fix the crash in a controlled test first
Real-World Scenarios
Netflix’s Chaos Monkey:
- Randomly kills servers in production (in non-peak hours)
- If users don’t notice, resilience is good
- If users do notice, they fix it before a real failure happens
AWS’s Game Days:
- Teams simulate outages (network down, database unreachable)
- Teams respond as if it’s a real incident
- They learn what breaks and practice recovery
Chaos Experiment Pattern
Every chaos experiment follows this pattern:
1. Measure baseline (system performing normally)
2. Inject fault (CPU spike, latency, memory allocation)
3. Observe impact (response time increases? Errors? Timeouts?)
4. Stop fault (system recovers?)
5. Measure recovery (how long until normal? Graceful or crash?)
6. Analyze (what broke? How can we fix it?)
Section 2: Types of Chaos Experiments
CPU Spike (Processing Overload)
What it does: Consumes 80%+ of CPU cycles
When to use: Test if app handles CPU-bound bottlenecks
Expected behavior:
- Response times increase (more contention)
- Throughput decreases
- Error rate stays ~same (if properly queued)
Real-world scenario:
- Batch job starts unexpectedly
- Traffic spike from marketing campaign
- Runaway process consumes CPU
Memory Exhaustion (OOM Scenarios)
What it does: Allocates N MB of RAM, making less available to app
When to use: Test memory pressure and caching behavior
Expected behavior:
- GC (garbage collection) pauses increase
- Response times become erratic (worse under load)
- May see OutOfMemory errors if too aggressive
Real-world scenario:
- Memory leak gradually consumes heap
- Caching layer fills up
- Multiple processes compete for RAM
Network Latency (Slow Connections)
What it does: Adds N milliseconds delay to all network calls
When to use: Test timeout handling and retry logic
Expected behavior:
- Requests take longer
- Some timeouts may occur
- Fallbacks/circuit-breakers should activate
Real-world scenario:
- WAN latency (slow intercontinental link)
- Database over network becomes slow
- Third-party API responds slowly
Packet Loss (Unreliable Networks)
What it does: Randomly drops N% of packets
When to use: Test recovery from connection failures
Expected behavior:
- Some requests fail
- Retries should handle gracefully
- If no retry logic, app breaks
Real-world scenario:
- WiFi in coffee shop (2-5% packet loss)
- Cellular network congestion
- Poor intercontinental connectivity
Section 3: CPU Spike Experiment
Experiment 1: Simple CPU Spike
Goal: See how your app responds to sudden CPU spike
Setup: Run a CPU spike for 5 seconds and measure response times
Step 1: Measure Baseline
Before causing chaos, measure normal performance:
# Measure response time under normal load
time curl http://myapp:3000/api/users
Record the result:
- Response time: ___ ms
- Status: 200 OK
- Timestamp: 10:30:00
Step 2: Trigger CPU Spike
curl -X POST http://localhost:8090/api/chaos/cpu-spike \
-H "Content-Type: application/json" \
-d '{
"duration": 5000,
"target": "http://myapp:3000"
}'
Response:
{
"experimentation_id": "chaos-cpu-001",
"status": "running",
"message": "CPU spike initiated",
"startedAt": "2026-02-21T20:30:00Z"
}
Step 3: Monitor During the Spike
While the chaos is running (5 seconds), repeatedly measure response time:
# Run this in a loop while chaos is happening
for i in {1..5}; do
echo "Attempt $i at $(date +%H:%M:%S)"
time curl http://myapp:3000/api/users
sleep 1
done
You should see:
- Attempt 1: 50ms (normal)
- Attempt 2: 150ms (degraded)
- Attempt 3: 250ms (worse)
- Attempt 4: 300ms (peak degradation)
- Attempt 5: 100ms (starting to recover)
Step 4: Measure Recovery
After the 5-second spike ends, continue measuring:
# Measure recovery over next 10 seconds
for i in {1..10}; do
echo "Recovery check $i at $(date +%H:%M:%S)"
time curl http://myapp:3000/api/users
sleep 1
done
What to look for:
- How long before response time returns to baseline?
- Any lingering errors?
- Any cascading failures?
Step 5: Analyze Results
Create a recovery chart:
Time | Response | Status | Notes
--------|----------|--------|----------
10:30:00| 50ms | 200 | Baseline
10:30:01| 50ms | 200 | Spike starts
10:30:02| 150ms | 200 | CPU busy
10:30:03| 250ms | 200 | Degraded
10:30:04| 300ms | 200 | Peak impact
10:30:05| 250ms | 200 | Still under load
10:30:06| 100ms | 200 | Recovering
10:30:07| 50ms | 200 | ✅ Recovered
Calculate metrics:
- Response time increase: (300 - 50) / 50 = 500% degradation
- MTTR (Mean Time To Recovery): 7:30:07 - 7:30:01 = 6 seconds
Section 4: Memory Exhaustion Experiment
Experiment 2: Memory Pressure
Goal: See how app behaves when memory becomes scarce
Step 1: Baseline Memory State
Check system memory:
free -h # On Linux
# or
vm_stat # On Mac
Record available memory: ___ GB
Step 2: Trigger Memory Spike
Allocate 50% of available memory:
curl -X POST http://localhost:8090/api/chaos/memory-spike \
-H "Content-Type: application/json" \
-d '{
"duration": 10000,
"memoryMB": 2048,
"target": "http://myapp:3000"
}'
Step 3: Monitor Behavior
During memory spike, watch for:
- Response time increase
- Error rate increase
- GC (garbage collection) pauses
- Out-of-Memory errors
# Monitor response times every 2 seconds
watch -n 2 'curl -s -w "%{time_total}\n" -o /dev/null http://myapp:3000/api/users'
Step 4: Measure Recovery
After spike ends, check:
- Memory released immediately?
- Any lingering slowness?
- Did anything crash?
free -h # Check if memory freed
curl http://myapp:3000/health # Check if app still responsive
Step 5: Results
Memory experiment expected outcomes:
| Outcome | What It Means |
|---|---|
| ✅ Response time stable | App has good memory management |
| ⚠️ Temporary slowness | GC pauses are normal, recovers quickly |
| ❌ Out-of-Memory errors | App not resilient to memory pressure |
| ❌ App crash | Memory management broken, needs fixing |
Section 5: Network Latency Experiment
Experiment 3: Slow Database/API Calls
Goal: Test how app handles slow external services
Scenario: Your database becomes slow, all queries take 1 second instead of 10ms
Step 1: Identify External Dependency
Find an external call your app makes:
- Database query (SELECT * FROM users)
- API call to third-party service
- Cache lookup (Redis)
Step 2: Inject Latency
Simulate slow external service:
curl -X POST http://localhost:8090/api/chaos/latency-inject \
-H "Content-Type: application/json" \
-d '{
"duration": 10000,
"latencyMs": 1000,
"target": "http://myapp:3000",
"affectPath": "/api/users" # Only this endpoint
}'
Step 3: Test with Load
While latency is injected, send requests:
# Rapid fire requests to see timeout behavior
for i in {1..20}; do
curl http://myapp:3000/api/users -m 5 # 5 second timeout
echo ""
sleep 0.5
done
Watch for:
- Requests timing out after 5 seconds?
- Circuit breaker kicking in?
- Fallback data returned?
- Error messages helpful?
Step 4: Analyze Timeout Behavior
Expected outcomes:
| Behavior | Resilience Level |
|---|---|
| ✅ Request times out after 5s, fallback returned | Excellent |
| ⚠️ Request hangs for 30s, then fails | Poor (timeout too long) |
| ❌ App becomes unresponsive | Bad (no timeout) |
Section 6: Running a Chaos Scenario
Build an Integrated Chaos Experiment
Instead of running experiments in isolation, create a multi-step scenario:
{
"name": "Resilience Under Stress",
"target": "http://myapp:3000",
"steps": [
{
"id": "baseline",
"name": "Measure baseline (normal load)",
"action": "redteam.xss",
"params": { "path": "/api/users", "param": "q" }
},
{
"id": "cpu-spike",
"name": "Spike CPU to 80%",
"action": "chaos.cpu",
"params": { "duration": 10000 }
},
{
"id": "measure-under-cpu",
"name": "Measure performance during CPU spike",
"action": "redteam.xss",
"params": { "path": "/api/users", "param": "q" },
"dependsOn": "cpu-spike"
},
{
"id": "wait-recovery",
"name": "Wait 5 seconds for recovery",
"action": "delay",
"params": { "duration": 5000 }
},
{
"id": "verify-recovered",
"name": "Verify normal performance restored",
"action": "redteam.xss",
"params": { "path": "/api/users", "param": "q" }
}
]
}
This scenario:
- Measures normal response time
- Spikes CPU for 10 seconds
- Measures response time under load
- Waits for recovery
- Verifies system recovered
Section 7: Analyzing Chaos Results
Key Metrics to Track
| Metric | How to Measure | Good Target |
|---|---|---|
| Response Time | time curl or APM |
+100% degradation max |
| MTTR | Time until status=200 | < 2 minutes |
| Error Rate | (failed requests / total) × 100 | < 5% during chaos |
| Graceful Degradation | Can app serve requests with reduced functionality? | Yes |
| Auto-Recovery | No manual intervention needed? | Yes |
Example: Chaos Report
EXPERIMENT: CPU Spike for 10 seconds
BASELINE:
- Response time: 50ms
- Error rate: 0%
- Throughput: 200 req/s
DURING SPIKE:
- Response time: 500ms (10x increase)
- Error rate: 2% (some timeouts)
- Throughput: 50 req/s (75% reduction)
RECOVERY:
- Time to 200ms response: 15 seconds
- Time to 50ms response: 30 seconds
- Time to 0% error rate: 20 seconds
FINDINGS:
⚠️ Response time increased too much (10x is too much, target: < 2x)
⚠️ MTTR is 30 seconds (target: < 10 seconds)
✅ No errors after 20 seconds
✅ Graceful degradation (requests still processed)
RECOMMENDATIONS:
1. Add more CPU capacity (horizontal scaling)
2. Implement request queuing (don't drop requests under load)
3. Reduce timeout to 10 seconds (currently 30)
4. Consider circuit breaker pattern
Section 8: Chaos Playbook Patterns
Pattern 1: Database Failure
Simulate: Database becomes unavailable
{
"name": "Database Unavailability Test",
"steps": [
{
"id": "kill-db",
"name": "Stop database connection",
"action": "chaos.connection-kill",
"params": { "target": "mydb:5432" }
},
{
"id": "measure-impact",
"name": "Measure app behavior",
"action": "redteam.xss",
"params": { "path": "/api/users", "param": "q" },
"timeout": 10000
},
{
"id": "restore",
"name": "Restore database",
"action": "chaos.connection-restore",
"params": { "target": "mydb:5432" }
}
]
}
Expect:
- App returns 503 (Service Unavailable) quickly
- Or uses cache/fallback data
- NOT: App hangs for 30 seconds
Pattern 2: Cascading Failure
Simulate: One service slow, causes others to back up
{
"name": "Cascading Failure Test",
"steps": [
{
"id": "slow-cache",
"name": "Slow down cache (Redis)",
"action": "chaos.latency",
"params": { "target": "redis:6379", "latencyMs": 2000 }
},
{
"id": "measure-cascade",
"name": "Measure if app is affected",
"action": "redteam.xss",
"params": { "path": "/api/users", "param": "q" },
"timeout": 5000
}
]
}
Expect:
- App uses fallback quickly (doesn’t wait for slow cache)
- NOT: Entire app becomes slow
Pattern 3: Peak Load
Simulate: 10x normal traffic suddenly
{
"name": "Peak Load Test",
"steps": [
{
"id": "baseline",
"name": "Normal traffic baseline",
"action": "chaos.attack",
"params": { "pattern": "normal", "rate": 100 }
},
{
"id": "spike",
"name": "10x traffic surge",
"action": "chaos.attack",
"params": { "pattern": "burst", "rate": 1000 }
},
{
"id": "verify",
"name": "Check if requests still handled",
"action": "redteam.xss",
"params": { "path": "/api/health", "param": "q" }
}
]
}
Section 9: Troubleshooting Chaos Experiments
Experiment Doesn’t Cause Expected Degradation
Symptom: CPU spike runs but response time doesn’t change
Causes:
- Apparatus CPU spike doesn’t affect target app (on different machine)
- CPU isn’t the bottleneck (it’s I/O, network, disk)
- Duration too short to measure
Solution:
- Increase duration to 20+ seconds
- Run Apparatus and target app on same machine
- Try memory or latency experiments instead
App Crashes During Experiment
Symptom: App crashes or restarts when chaos runs
Good news: You found a real problem! Your app isn’t resilient.
Solution:
- Document exactly what caused crash (which chaos experiment)
- Fix the underlying issue (add error handling, timeouts, circuit breaker)
- Re-run experiment to verify fix
Can’t Measure Results
Symptom: Response times all over the place, hard to see pattern
Solution:
- Increase sample size (run 100 requests, not 10)
- Average results:
(sum of times) / count - Use monitoring tools (APM, Prometheus) for real data
Summary
You’ve learned how to:
- ✅ Understand chaos engineering principles
- ✅ Design and run CPU spike experiments
- ✅ Test memory exhaustion and recovery
- ✅ Inject network latency to test timeout handling
- ✅ Build multi-step chaos scenarios
- ✅ Analyze results and measure MTTR
- ✅ Identify resilience gaps and fix them
Next Steps
Continue your chaos engineering journey:
- Run against your own app — Design experiments specific to your architecture
- Automate in CI/CD — Run chaos tests on every deployment
- Game Days — Invite your team to respond to chaos together
- Production Testing — Once confident, run (safe) chaos in production
- Learn from Netflix — Read “The Netflix Chaos Monkey Papers”
Made with ❤️ for DevOps, SREs, and resilience engineers