Litmus Chaos - Enterprise-Grade Chaos Engineering for Kubernetes
Chaos engineering validates system resilience by introducing controlled failures. LitmusChaos makes this practice accessible for Kubernetes environments, giving SRE teams the tools to test failure scenarios before they happen in production.
Why Chaos Engineering Matters for SREs
Real failures don’t wait for convenient times. Networks partition during peak traffic. Nodes fail during deployments. Memory leaks surface under load. Traditional testing catches functional bugs but misses reliability gaps.
LitmusChaos helps answer critical SRE questions:
- Does your app recover when pods get terminated?
- How does network latency affect user experience?
- Can your system handle node failures gracefully?
- Do circuit breakers work under actual load?
Architecture Deep Dive
LitmusChaos follows a control plane and execution plane architecture:
Control Plane (Chaos Center)
The control plane manages experiment lifecycle:
- Portal: Web UI for experiment design and scheduling
- Authentication: RBAC integration with Kubernetes
- Workflow Engine: Orchestrates complex experiment sequences
- Data Plane: Stores experiment results and metrics
Execution Plane
The execution plane runs experiments:
- Chaos Agent: Deployed per cluster, executes experiments
- Chaos Operator: Manages Custom Resource lifecycle
- Experiment Pods: Run actual chaos injection logic
- Chaos Exporter: Pushes metrics to monitoring systems
This separation means you can manage experiments centrally while running them across multiple clusters.
Core Components Explained
ChaosExperiment: The Experiment Template
ChaosExperiment defines what chaos to inject. Think of it as a reusable template:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-memory-hog
labels:
instance: litmus
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create","delete","get","list","patch","update"]
image: "litmuschaos/go-runner:latest"
args:
- -c
- ./experiments -name pod-memory-hog
command:
- /bin/bash
env:
- name: MEMORY_CONSUMPTION
value: '500'
- name: TOTAL_CHAOS_DURATION
value: '60'
labels:
experiment: pod-memory-hog
Key fields:
Field | Purpose | Description |
---|---|---|
permissions | RBAC rules | Kubernetes permissions the experiment needs |
image | Container image | Docker image that runs the chaos logic |
env variables | Configuration | Parameters to configure experiment behavior |
scope | Access level | Cluster or Namespace level permissions |
ChaosEngine: The Experiment Executor
ChaosEngine links experiments to target workloads:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-memory-chaos
spec:
engineState: 'active'
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-memory-hog
spec:
components:
env:
- name: MEMORY_CONSUMPTION
value: '1000'
- name: PODS_AFFECTED_PERC
value: '50'
probe:
- name: "nginx-health-check"
type: "httpProbe"
mode: "Continuous"
runProperties:
probeTimeout: 10s
interval: 5s
retry: 3
httpProbe/inputs:
url: "http://nginx-service:80/health"
method:
get:
criteria: "=="
responseCode: "200"
Important concepts:
Concept | Purpose | Details |
---|---|---|
appinfo | Target selection | Selects target workload using labels and namespace |
chaosServiceAccount | Permissions | Service account with required RBAC permissions |
probe | Health monitoring | Continuous health checks during experiments |
components.env | Parameter override | Override default experiment parameters |
ChaosResult: The Experiment Outcome
ChaosResult captures experiment execution data:
1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
name: nginx-memory-chaos-pod-memory-hog
spec:
experimentstatus:
phase: "Completed"
verdict: "Pass"
history:
passedRuns: 1
failedRuns: 0
stoppedRuns: 0
probesuccesspercentage: "100"
Experiment Categories for SRE Teams
Resource Chaos
Test how apps behave when compute resources are constrained:
CPU Hog: Consume CPU cycles to simulate high load
1
2
3
4
5
6
- name: TOTAL_CHAOS_DURATION
value: '120'
- name: CPU_CORES
value: '2'
- name: PODS_AFFECTED_PERC
value: '25'
Memory Hog: Fill up pod memory to test OOM handling
1
2
3
4
- name: MEMORY_CONSUMPTION
value: '500' # MB
- name: PODS_AFFECTED_PERC
value: '50'
Network Chaos
Simulate network issues that cause real outages:
Network Loss: Drop packets to test retry logic
1
2
3
4
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: '5'
- name: DESTINATION_IPS
value: 'service-b.namespace.svc.cluster.local'
Network Latency: Add delays to test timeout handling
1
2
3
4
- name: NETWORK_LATENCY
value: '2000' # milliseconds
- name: JITTER
value: '200'
Network Partition: Block traffic between services
1
2
3
4
- name: DESTINATION_IPS
value: 'database.production.svc.cluster.local'
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: '100'
Pod Chaos
Test application resilience patterns:
Pod Delete: Verify graceful shutdown and startup
1
2
3
4
- name: FORCE
value: 'false' # Graceful termination
- name: CHAOS_INTERVAL
value: '30' # Seconds between deletions
Container Kill: Test signal handling
1
2
3
4
- name: TARGET_CONTAINER
value: 'app-container'
- name: SIGNAL
value: 'SIGKILL'
Node Chaos
Test cluster-level resilience:
Node Drain: Simulate node maintenance
1
2
3
4
- name: TARGET_NODE
value: 'worker-node-1'
- name: FORCE
value: 'false'
Disk Fill: Test disk space monitoring
1
2
3
4
- name: FILL_PERCENTAGE
value: '80'
- name: EPHEMERAL_STORAGE_MEBIBYTES
value: '1000'
Advanced SRE Use Cases
Game Day Scenarios
Create complex failure scenarios that mirror real incidents:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: black-friday-gameday
spec:
entrypoint: gameday-scenario
templates:
- name: gameday-scenario
steps:
- - name: baseline-check
template: health-probe
- - name: traffic-spike
template: load-generator
- name: database-latency
template: network-chaos
- - name: pod-failures
template: pod-delete-chaos
- - name: recovery-check
template: health-probe
Progressive Chaos
Start small and increase blast radius:
1
2
3
4
5
6
7
8
9
10
11
# Week 1: Single pod
- name: PODS_AFFECTED_PERC
value: '10'
# Week 2: Multiple pods
- name: PODS_AFFECTED_PERC
value: '25'
# Week 3: Majority of pods
- name: PODS_AFFECTED_PERC
value: '60'
Multi-Service Dependencies
Test cascade failures across service boundaries:
1
2
3
4
5
6
7
8
9
spec:
experiments:
- name: service-a-pod-delete
spec:
probe:
- name: service-b-health
type: httpProbe
- name: service-c-health
type: httpProbe
Observability Integration
Prometheus Metrics
LitmusChaos exports key metrics:
1
2
3
4
5
6
7
8
9
10
# Experiment results
litmuschaos_experiment_passed_total
litmuschaos_experiment_failed_total
litmuschaos_experiment_awaited_total
# Experiment duration
litmuschaos_experiment_duration_seconds
# Probe success rate
litmuschaos_probe_success_percentage
Query experiment success rate:
rate(litmuschaos_experiment_passed_total[5m]) /
rate(litmuschaos_experiment_total[5m]) * 100
Grafana Dashboards
Essential panels for SRE teams:
Panel | Metric | Purpose |
---|---|---|
Success Rate | Experiment pass/fail ratio | Track chaos experiment reliability |
MTTR | Recovery time during chaos | Measure system recovery speed |
Availability | Service uptime percentage | Monitor service health during tests |
Resource Usage | CPU/Memory utilization | Track resource impact of chaos |
Alert Rules
Monitor chaos experiment health:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
groups:
- name: chaos-engineering
rules:
- alert: ChaosExperimentFailed
expr: litmuschaos_experiment_failed_total > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Chaos experiment failed"
- alert: ChaosExperimentStuck
expr: litmuschaos_experiment_awaited_total > 300
for: 5m
labels:
severity: critical
annotations:
summary: "Chaos experiment stuck for >5 minutes"
Production Deployment Patterns
Namespace Isolation
Deploy chaos operator per namespace for blast radius control:
1
2
3
4
5
6
7
8
9
10
11
# Install operator in staging namespace
helm install litmus-staging litmuschaos/litmus \
--namespace=staging-chaos \
--set chaos.enabled=true \
--set portal.enabled=false
# Install operator in production namespace
helm install litmus-prod litmuschaos/litmus \
--namespace=production-chaos \
--set chaos.enabled=true \
--set portal.enabled=false
RBAC Configuration
Limit chaos scope with proper permissions:
1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: chaos-engineer
rules:
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines", "chaosexperiments", "chaosresults"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete", "get", "list"]
resourceNames: ["app-*"] # Only target app pods
Multi-Cluster Setup
Central control plane with distributed execution:
1
2
3
4
5
6
7
8
9
10
11
12
13
# Control plane cluster
chaos-center:
enabled: true
server:
service:
type: LoadBalancer
# Execution plane clusters
chaos-agent:
enabled: true
controlPlane:
endpoint: "https://chaos-center.example.com"
accessKey: "agent-access-key"
CI/CD Integration
GitOps Workflow
Store experiments in Git for version control:
1
2
3
4
5
6
7
8
9
10
11
chaos-experiments/
├── staging/
│ ├── pod-delete.yaml
│ ├── memory-hog.yaml
│ └── network-latency.yaml
├── production/
│ ├── pod-delete.yaml
│ └── network-partition.yaml
└── workflows/
├── weekly-gameday.yaml
└── release-validation.yaml
GitHub Actions Integration
Run chaos tests in CI pipeline:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
name: Chaos Testing
on:
push:
branches: [main]
schedule:
- cron: '0 10 * * 1' # Weekly on Monday
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Run staging chaos tests
run: |
kubectl apply -f chaos-experiments/staging/
kubectl wait --for=condition=complete \
chaosresult/nginx-chaos-pod-delete \
--timeout=300s
- name: Check experiment results
run: |
RESULT=$(kubectl get chaosresult nginx-chaos-pod-delete \
-o jsonpath='{.spec.experimentstatus.verdict}')
if [ "$RESULT" != "Pass" ]; then
echo "Chaos experiment failed"
exit 1
fi
Deployment Gates
Block deployments if chaos tests fail:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Azure DevOps pipeline
- stage: ChaosValidation
dependsOn: Deployment
jobs:
- job: RunChaosTests
steps:
- task: Kubernetes@1
inputs:
command: apply
arguments: -f $(System.DefaultWorkingDirectory)/chaos/
- task: Kubernetes@1
inputs:
command: wait
arguments: --for=condition=complete chaosresult/app-chaos-result --timeout=300s
- powershell: |
$result = kubectl get chaosresult app-chaos-result -o jsonpath='{.spec.experimentstatus.verdict}'
if ($result -ne "Pass") {
Write-Host "##vso[task.logissue type=error]Chaos test failed: $result"
exit 1
}
Security Considerations
Least Privilege Access
Grant minimal permissions for chaos operations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-runner
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaos-runner
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete", "get", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list"]
Network Policies
Restrict chaos agent network access:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: chaos-agent-policy
spec:
podSelector:
matchLabels:
app: chaos-agent
policyTypes:
- Ingress
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: TCP
port: 443 # Kubernetes API
- to:
- podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090 # Metrics export
Audit Logging
Track all chaos activities:
1
2
3
4
5
6
7
8
9
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
namespaces: ["default", "production"]
resources:
- group: "litmuschaos.io"
resources: ["chaosengines", "chaosexperiments"]
verbs: ["create", "update", "patch", "delete"]
Troubleshooting Common Issues
Experiment Stuck in Running State
Check experiment pod logs:
1
kubectl logs -l experiment=pod-delete -n litmus
Common causes:
Issue | Cause | Solution |
---|---|---|
Permission denied | Insufficient RBAC | Check service account permissions |
No targets found | Wrong label selectors | Verify app labels and selectors |
Connection timeout | Network issues | Check cluster connectivity |
Probe Failures
Debug probe configuration:
1
kubectl describe chaosresult experiment-name
Check probe endpoints:
1
2
kubectl run debug --image=curlimages/curl --rm -it -- \
curl -v http://service-name:80/health
Resource Cleanup
Remove stuck experiments:
1
2
3
4
kubectl patch chaosengine experiment-name \
--type merge -p '{"spec":{"engineState":"stop"}}'
kubectl delete chaosengine experiment-name --force --grace-period=0
Best Practices for SRE Teams
Practice | Actions | Benefits |
---|---|---|
Start Small | Begin with non-production, single pod failures, low-traffic periods | Minimize risk, build confidence |
Automate Everything | Version control experiments, GitOps deployment, automated analysis | Consistent execution, repeatable results |
Document Learnings | Record outcomes, document weaknesses, share knowledge | Team learning, improved runbooks |
Measure Impact | Track MTTR, monitor availability, quantify improvements | Data-driven reliability gains |
Getting Started with LitmusChaos
Installation Options
Helm Installation (Recommended for production):
1
2
3
4
5
6
7
8
9
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
# Install with custom values
helm install litmus litmuschaos/litmus \
--namespace=litmus \
--create-namespace \
--set portal.server.service.type=LoadBalancer \
--set mongodb.service.port=27017
Kubectl Installation (Quick start):
1
kubectl apply -f https://litmuschaos.github.io/litmus/2.14.0/litmus-2.14.0.yaml
Verify Installation:
1
2
kubectl get pods -n litmus
kubectl get crds | grep chaos
First Experiment
Create a simple pod deletion experiment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Install experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/experiment.yaml
# Create chaos engine
cat <<EOF | kubectl apply -f -
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
EOF
Check experiment results:
1
2
kubectl get chaosresult
kubectl describe chaosresult nginx-chaos-pod-delete
Hands-on Demo: Litmus on Azure AKS
I built a complete demo to get you started fast:
What’s included:
Component | Description | Value for SRE Teams |
---|---|---|
AKS Setup | Complete cluster configuration | Production-ready environment |
Installation Guide | Step-by-step Litmus deployment | Quick start implementation |
SRE Experiments | Pod failures, network chaos, node disruptions | Real-world test scenarios |
Azure Monitor | Observability integration | Enterprise monitoring |
Incident Scenarios | Based on actual production incidents | Proven failure patterns |
Templates | Infrastructure and experiment templates | Accelerated deployment |
Built for SRE teams who want to start chaos engineering today.
LitmusChaos transforms how SRE teams approach reliability testing. By systematically introducing failures, teams build confidence in their systems and discover weaknesses before customers do. Start small, automate everything, and measure the impact on your reliability metrics.