Litmus Chaos - Enterprise-Grade Chaos Engineering for Kubernetes
// Discover Litmus, a CNCF incubating project that provides a complete chaos engineering platform with GitOps integration, extensive experiment library, and enterprise features for building resilient cloud-native applications.
Chaos engineering validates system resilience by introducing controlled failures. LitmusChaos makes this practice accessible for Kubernetes environments, giving SRE teams the tools to test failure scenarios before they happen in production.
Why Chaos Engineering Matters for SREs#
Real failures don’t wait for convenient times. Networks partition during peak traffic. Nodes fail during deployments. Memory leaks surface under load. Traditional testing catches functional bugs but misses reliability gaps.
LitmusChaos helps answer critical SRE questions:
- Does your app recover when pods get terminated?
- How does network latency affect user experience?
- Can your system handle node failures gracefully?
- Do circuit breakers work under actual load?
Architecture Deep Dive#
LitmusChaos follows a control plane and execution plane architecture:
Control Plane (Chaos Center)#
The control plane manages experiment lifecycle:
- Portal: Web UI for experiment design and scheduling
- Authentication: RBAC integration with Kubernetes
- Workflow Engine: Orchestrates complex experiment sequences
- Data Plane: Stores experiment results and metrics
Execution Plane#
The execution plane runs experiments:
- Chaos Agent: Deployed per cluster, executes experiments
- Chaos Operator: Manages Custom Resource lifecycle
- Experiment Pods: Run actual chaos injection logic
- Chaos Exporter: Pushes metrics to monitoring systems
This separation means you can manage experiments centrally while running them across multiple clusters.
Core Components Explained#
ChaosExperiment: The Experiment Template#
ChaosExperiment defines what chaos to inject. Think of it as a reusable template:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
name: pod-memory-hog
labels:
instance: litmus
spec:
definition:
scope: Namespaced
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create","delete","get","list","patch","update"]
image: "litmuschaos/go-runner:latest"
args:
- -c
- ./experiments -name pod-memory-hog
command:
- /bin/bash
env:
- name: MEMORY_CONSUMPTION
value: '500'
- name: TOTAL_CHAOS_DURATION
value: '60'
labels:
experiment: pod-memory-hog
Key fields:
| Field | Purpose | Description |
|---|---|---|
| permissions | RBAC rules | Kubernetes permissions the experiment needs |
| image | Container image | Docker image that runs the chaos logic |
| env variables | Configuration | Parameters to configure experiment behavior |
| scope | Access level | Cluster or Namespace level permissions |
ChaosEngine: The Experiment Executor#
ChaosEngine links experiments to target workloads:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-memory-chaos
spec:
engineState: 'active'
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-memory-hog
spec:
components:
env:
- name: MEMORY_CONSUMPTION
value: '1000'
- name: PODS_AFFECTED_PERC
value: '50'
probe:
- name: "nginx-health-check"
type: "httpProbe"
mode: "Continuous"
runProperties:
probeTimeout: 10s
interval: 5s
retry: 3
httpProbe/inputs:
url: "http://nginx-service:80/health"
method:
get:
criteria: "=="
responseCode: "200"
Important concepts:
| Concept | Purpose | Details |
|---|---|---|
| appinfo | Target selection | Selects target workload using labels and namespace |
| chaosServiceAccount | Permissions | Service account with required RBAC permissions |
| probe | Health monitoring | Continuous health checks during experiments |
| components.env | Parameter override | Override default experiment parameters |
ChaosResult: The Experiment Outcome#
ChaosResult captures experiment execution data:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
name: nginx-memory-chaos-pod-memory-hog
spec:
experimentstatus:
phase: "Completed"
verdict: "Pass"
history:
passedRuns: 1
failedRuns: 0
stoppedRuns: 0
probesuccesspercentage: "100"
Experiment Categories for SRE Teams#
Resource Chaos#
Test how apps behave when compute resources are constrained:
CPU Hog: Consume CPU cycles to simulate high load
- name: TOTAL_CHAOS_DURATION
value: '120'
- name: CPU_CORES
value: '2'
- name: PODS_AFFECTED_PERC
value: '25'
Memory Hog: Fill up pod memory to test OOM handling
- name: MEMORY_CONSUMPTION
value: '500' # MB
- name: PODS_AFFECTED_PERC
value: '50'
Network Chaos#
Simulate network issues that cause real outages:
Network Loss: Drop packets to test retry logic
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: '5'
- name: DESTINATION_IPS
value: 'service-b.namespace.svc.cluster.local'
Network Latency: Add delays to test timeout handling
- name: NETWORK_LATENCY
value: '2000' # milliseconds
- name: JITTER
value: '200'
Network Partition: Block traffic between services
- name: DESTINATION_IPS
value: 'database.production.svc.cluster.local'
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: '100'
Pod Chaos#
Test application resilience patterns:
Pod Delete: Verify graceful shutdown and startup
- name: FORCE
value: 'false' # Graceful termination
- name: CHAOS_INTERVAL
value: '30' # Seconds between deletions
Container Kill: Test signal handling
- name: TARGET_CONTAINER
value: 'app-container'
- name: SIGNAL
value: 'SIGKILL'
Node Chaos#
Test cluster-level resilience:
Node Drain: Simulate node maintenance
- name: TARGET_NODE
value: 'worker-node-1'
- name: FORCE
value: 'false'
Disk Fill: Test disk space monitoring
- name: FILL_PERCENTAGE
value: '80'
- name: EPHEMERAL_STORAGE_MEBIBYTES
value: '1000'
Advanced SRE Use Cases#
Game Day Scenarios#
Create complex failure scenarios that mirror real incidents:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: black-friday-gameday
spec:
entrypoint: gameday-scenario
templates:
- name: gameday-scenario
steps:
- - name: baseline-check
template: health-probe
- - name: traffic-spike
template: load-generator
- name: database-latency
template: network-chaos
- - name: pod-failures
template: pod-delete-chaos
- - name: recovery-check
template: health-probe
Progressive Chaos#
Start small and increase blast radius:
# Week 1: Single pod
- name: PODS_AFFECTED_PERC
value: '10'
# Week 2: Multiple pods
- name: PODS_AFFECTED_PERC
value: '25'
# Week 3: Majority of pods
- name: PODS_AFFECTED_PERC
value: '60'
Multi-Service Dependencies#
Test cascade failures across service boundaries:
spec:
experiments:
- name: service-a-pod-delete
spec:
probe:
- name: service-b-health
type: httpProbe
- name: service-c-health
type: httpProbe
Observability Integration#
Prometheus Metrics#
LitmusChaos exports key metrics:
# Experiment results
litmuschaos_experiment_passed_total
litmuschaos_experiment_failed_total
litmuschaos_experiment_awaited_total
# Experiment duration
litmuschaos_experiment_duration_seconds
# Probe success rate
litmuschaos_probe_success_percentage
Query experiment success rate:
rate(litmuschaos_experiment_passed_total[5m]) /
rate(litmuschaos_experiment_total[5m]) * 100
Grafana Dashboards#
Essential panels for SRE teams:
| Panel | Metric | Purpose |
|---|---|---|
| Success Rate | Experiment pass/fail ratio | Track chaos experiment reliability |
| MTTR | Recovery time during chaos | Measure system recovery speed |
| Availability | Service uptime percentage | Monitor service health during tests |
| Resource Usage | CPU/Memory utilization | Track resource impact of chaos |
Alert Rules#
Monitor chaos experiment health:
groups:
- name: chaos-engineering
rules:
- alert: ChaosExperimentFailed
expr: litmuschaos_experiment_failed_total > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Chaos experiment {{ $labels.experiment }} failed"
- alert: ChaosExperimentStuck
expr: litmuschaos_experiment_awaited_total > 300
for: 5m
labels:
severity: critical
annotations:
summary: "Chaos experiment stuck for >5 minutes"
Production Deployment Patterns#
Namespace Isolation#
Deploy chaos operator per namespace for blast radius control:
# Install operator in staging namespace
helm install litmus-staging litmuschaos/litmus \
--namespace=staging-chaos \
--set chaos.enabled=true \
--set portal.enabled=false
# Install operator in production namespace
helm install litmus-prod litmuschaos/litmus \
--namespace=production-chaos \
--set chaos.enabled=true \
--set portal.enabled=false
RBAC Configuration#
Limit chaos scope with proper permissions:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: chaos-engineer
rules:
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines", "chaosexperiments", "chaosresults"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete", "get", "list"]
resourceNames: ["app-*"] # Only target app pods
Multi-Cluster Setup#
Central control plane with distributed execution:
# Control plane cluster
chaos-center:
enabled: true
server:
service:
type: LoadBalancer
# Execution plane clusters
chaos-agent:
enabled: true
controlPlane:
endpoint: "https://chaos-center.example.com"
accessKey: "agent-access-key"
CI/CD Integration#
GitOps Workflow#
Store experiments in Git for version control:
chaos-experiments/
├── staging/
│ ├── pod-delete.yaml
│ ├── memory-hog.yaml
│ └── network-latency.yaml
├── production/
│ ├── pod-delete.yaml
│ └── network-partition.yaml
└── workflows/
├── weekly-gameday.yaml
└── release-validation.yaml
GitHub Actions Integration#
Run chaos tests in CI pipeline:
name: Chaos Testing
on:
push:
branches: [main]
schedule:
- cron: '0 10 * * 1' # Weekly on Monday
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup kubectl
uses: azure/setup-kubectl@v3
- name: Run staging chaos tests
run: |
kubectl apply -f chaos-experiments/staging/
kubectl wait --for=condition=complete \
chaosresult/nginx-chaos-pod-delete \
--timeout=300s
- name: Check experiment results
run: |
RESULT=$(kubectl get chaosresult nginx-chaos-pod-delete \
-o jsonpath='{.spec.experimentstatus.verdict}')
if [ "$RESULT" != "Pass" ]; then
echo "Chaos experiment failed"
exit 1
fi
Deployment Gates#
Block deployments if chaos tests fail:
# Azure DevOps pipeline
- stage: ChaosValidation
dependsOn: Deployment
jobs:
- job: RunChaosTests
steps:
- task: Kubernetes@1
inputs:
command: apply
arguments: -f $(System.DefaultWorkingDirectory)/chaos/
- task: Kubernetes@1
inputs:
command: wait
arguments: --for=condition=complete chaosresult/app-chaos-result --timeout=300s
- powershell: |
$result = kubectl get chaosresult app-chaos-result -o jsonpath='{.spec.experimentstatus.verdict}'
if ($result -ne "Pass") {
Write-Host "##vso[task.logissue type=error]Chaos test failed: $result"
exit 1
}
Security Considerations#
Least Privilege Access#
Grant minimal permissions for chaos operations:
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-runner
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaos-runner
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete", "get", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list"]
Network Policies#
Restrict chaos agent network access:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: chaos-agent-policy
spec:
podSelector:
matchLabels:
app: chaos-agent
policyTypes:
- Ingress
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: TCP
port: 443 # Kubernetes API
- to:
- podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 9090 # Metrics export
Audit Logging#
Track all chaos activities:
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
namespaces: ["default", "production"]
resources:
- group: "litmuschaos.io"
resources: ["chaosengines", "chaosexperiments"]
verbs: ["create", "update", "patch", "delete"]
Troubleshooting Common Issues#
Experiment Stuck in Running State#
Check experiment pod logs:
kubectl logs -l experiment=pod-delete -n litmus
Common causes:
| Issue | Cause | Solution |
|---|---|---|
| Permission denied | Insufficient RBAC | Check service account permissions |
| No targets found | Wrong label selectors | Verify app labels and selectors |
| Connection timeout | Network issues | Check cluster connectivity |
Probe Failures#
Debug probe configuration:
kubectl describe chaosresult experiment-name
Check probe endpoints:
kubectl run debug --image=curlimages/curl --rm -it -- \
curl -v http://service-name:80/health
Resource Cleanup#
Remove stuck experiments:
kubectl patch chaosengine experiment-name \
--type merge -p '{"spec":{"engineState":"stop"}}'
kubectl delete chaosengine experiment-name --force --grace-period=0
Best Practices for SRE Teams#
| Practice | Actions | Benefits |
|---|---|---|
| Start Small | Begin with non-production, single pod failures, low-traffic periods | Minimize risk, build confidence |
| Automate Everything | Version control experiments, GitOps deployment, automated analysis | Consistent execution, repeatable results |
| Document Learnings | Record outcomes, document weaknesses, share knowledge | Team learning, improved runbooks |
| Measure Impact | Track MTTR, monitor availability, quantify improvements | Data-driven reliability gains |
Getting Started with LitmusChaos#
Installation Options#
Helm Installation (Recommended for production):
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
# Install with custom values
helm install litmus litmuschaos/litmus \
--namespace=litmus \
--create-namespace \
--set portal.server.service.type=LoadBalancer \
--set mongodb.service.port=27017
Kubectl Installation (Quick start):
kubectl apply -f https://litmuschaos.github.io/litmus/2.14.0/litmus-2.14.0.yaml
Verify Installation:
kubectl get pods -n litmus
kubectl get crds | grep chaos
First Experiment#
Create a simple pod deletion experiment:
# Install experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/experiment.yaml
# Create chaos engine
cat <<EOF | kubectl apply -f -
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
EOF
Check experiment results:
kubectl get chaosresult
kubectl describe chaosresult nginx-chaos-pod-delete
Hands-on Demo: Litmus on Azure AKS#
I built a complete demo to get you started fast:
What’s included:
| Component | Description | Value for SRE Teams |
|---|---|---|
| AKS Setup | Complete cluster configuration | Production-ready environment |
| Installation Guide | Step-by-step Litmus deployment | Quick start implementation |
| SRE Experiments | Pod failures, network chaos, node disruptions | Real-world test scenarios |
| Azure Monitor | Observability integration | Enterprise monitoring |
| Incident Scenarios | Based on actual production incidents | Proven failure patterns |
| Templates | Infrastructure and experiment templates | Accelerated deployment |
Built for SRE teams who want to start chaos engineering today.
LitmusChaos transforms how SRE teams approach reliability testing. By systematically introducing failures, teams build confidence in their systems and discover weaknesses before customers do. Start small, automate everything, and measure the impact on your reliability metrics.

