Post

Litmus Chaos - Enterprise-Grade Chaos Engineering for Kubernetes

Chaos engineering validates system resilience by introducing controlled failures. LitmusChaos makes this practice accessible for Kubernetes environments, giving SRE teams the tools to test failure scenarios before they happen in production.

Why Chaos Engineering Matters for SREs

Real failures don’t wait for convenient times. Networks partition during peak traffic. Nodes fail during deployments. Memory leaks surface under load. Traditional testing catches functional bugs but misses reliability gaps.

LitmusChaos helps answer critical SRE questions:

  • Does your app recover when pods get terminated?
  • How does network latency affect user experience?
  • Can your system handle node failures gracefully?
  • Do circuit breakers work under actual load?

Architecture Deep Dive

LitmusChaos follows a control plane and execution plane architecture:

Control Plane (Chaos Center)

The control plane manages experiment lifecycle:

  • Portal: Web UI for experiment design and scheduling
  • Authentication: RBAC integration with Kubernetes
  • Workflow Engine: Orchestrates complex experiment sequences
  • Data Plane: Stores experiment results and metrics

Execution Plane

The execution plane runs experiments:

  • Chaos Agent: Deployed per cluster, executes experiments
  • Chaos Operator: Manages Custom Resource lifecycle
  • Experiment Pods: Run actual chaos injection logic
  • Chaos Exporter: Pushes metrics to monitoring systems

This separation means you can manage experiments centrally while running them across multiple clusters.

Core Components Explained

ChaosExperiment: The Experiment Template

ChaosExperiment defines what chaos to inject. Think of it as a reusable template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-memory-hog
  labels:
    instance: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create","delete","get","list","patch","update"]
    image: "litmuschaos/go-runner:latest"
    args:
    - -c
    - ./experiments -name pod-memory-hog
    command:
    - /bin/bash
    env:
    - name: MEMORY_CONSUMPTION
      value: '500'
    - name: TOTAL_CHAOS_DURATION
      value: '60'
    labels:
      experiment: pod-memory-hog

Key fields:

FieldPurposeDescription
permissionsRBAC rulesKubernetes permissions the experiment needs
imageContainer imageDocker image that runs the chaos logic
env variablesConfigurationParameters to configure experiment behavior
scopeAccess levelCluster or Namespace level permissions

ChaosEngine: The Experiment Executor

ChaosEngine links experiments to target workloads:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-memory-chaos
spec:
  engineState: 'active'
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-hog
    spec:
      components:
        env:
        - name: MEMORY_CONSUMPTION
          value: '1000'
        - name: PODS_AFFECTED_PERC
          value: '50'
      probe:
      - name: "nginx-health-check"
        type: "httpProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          interval: 5s
          retry: 3
        httpProbe/inputs:
          url: "http://nginx-service:80/health"
          method:
            get:
              criteria: "=="
              responseCode: "200"

Important concepts:

ConceptPurposeDetails
appinfoTarget selectionSelects target workload using labels and namespace
chaosServiceAccountPermissionsService account with required RBAC permissions
probeHealth monitoringContinuous health checks during experiments
components.envParameter overrideOverride default experiment parameters

ChaosResult: The Experiment Outcome

ChaosResult captures experiment execution data:

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
  name: nginx-memory-chaos-pod-memory-hog
spec:
  experimentstatus:
    phase: "Completed"
    verdict: "Pass"
  history:
    passedRuns: 1
    failedRuns: 0
    stoppedRuns: 0
  probesuccesspercentage: "100"

Experiment Categories for SRE Teams

Resource Chaos

Test how apps behave when compute resources are constrained:

CPU Hog: Consume CPU cycles to simulate high load

1
2
3
4
5
6
- name: TOTAL_CHAOS_DURATION
  value: '120'
- name: CPU_CORES  
  value: '2'
- name: PODS_AFFECTED_PERC
  value: '25'

Memory Hog: Fill up pod memory to test OOM handling

1
2
3
4
- name: MEMORY_CONSUMPTION
  value: '500'  # MB
- name: PODS_AFFECTED_PERC  
  value: '50'

Network Chaos

Simulate network issues that cause real outages:

Network Loss: Drop packets to test retry logic

1
2
3
4
- name: NETWORK_PACKET_LOSS_PERCENTAGE
  value: '5'
- name: DESTINATION_IPS
  value: 'service-b.namespace.svc.cluster.local'

Network Latency: Add delays to test timeout handling

1
2
3
4
- name: NETWORK_LATENCY
  value: '2000'  # milliseconds
- name: JITTER
  value: '200'

Network Partition: Block traffic between services

1
2
3
4
- name: DESTINATION_IPS
  value: 'database.production.svc.cluster.local'
- name: NETWORK_PACKET_LOSS_PERCENTAGE
  value: '100'

Pod Chaos

Test application resilience patterns:

Pod Delete: Verify graceful shutdown and startup

1
2
3
4
- name: FORCE
  value: 'false'  # Graceful termination
- name: CHAOS_INTERVAL
  value: '30'     # Seconds between deletions

Container Kill: Test signal handling

1
2
3
4
- name: TARGET_CONTAINER
  value: 'app-container'
- name: SIGNAL
  value: 'SIGKILL'

Node Chaos

Test cluster-level resilience:

Node Drain: Simulate node maintenance

1
2
3
4
- name: TARGET_NODE
  value: 'worker-node-1'
- name: FORCE
  value: 'false'

Disk Fill: Test disk space monitoring

1
2
3
4
- name: FILL_PERCENTAGE
  value: '80'
- name: EPHEMERAL_STORAGE_MEBIBYTES
  value: '1000'

Advanced SRE Use Cases

Game Day Scenarios

Create complex failure scenarios that mirror real incidents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: black-friday-gameday
spec:
  entrypoint: gameday-scenario
  templates:
  - name: gameday-scenario
    steps:
    - - name: baseline-check
        template: health-probe
    - - name: traffic-spike
        template: load-generator
      - name: database-latency
        template: network-chaos
    - - name: pod-failures
        template: pod-delete-chaos
    - - name: recovery-check  
        template: health-probe

Progressive Chaos

Start small and increase blast radius:

1
2
3
4
5
6
7
8
9
10
11
# Week 1: Single pod
- name: PODS_AFFECTED_PERC
  value: '10'

# Week 2: Multiple pods  
- name: PODS_AFFECTED_PERC
  value: '25'

# Week 3: Majority of pods
- name: PODS_AFFECTED_PERC
  value: '60'

Multi-Service Dependencies

Test cascade failures across service boundaries:

1
2
3
4
5
6
7
8
9
spec:
  experiments:
  - name: service-a-pod-delete
    spec:
      probe:
      - name: service-b-health
        type: httpProbe
      - name: service-c-health  
        type: httpProbe

Observability Integration

Prometheus Metrics

LitmusChaos exports key metrics:

1
2
3
4
5
6
7
8
9
10
# Experiment results
litmuschaos_experiment_passed_total
litmuschaos_experiment_failed_total
litmuschaos_experiment_awaited_total

# Experiment duration
litmuschaos_experiment_duration_seconds

# Probe success rate
litmuschaos_probe_success_percentage

Query experiment success rate:

rate(litmuschaos_experiment_passed_total[5m]) / 
rate(litmuschaos_experiment_total[5m]) * 100

Grafana Dashboards

Essential panels for SRE teams:

PanelMetricPurpose
Success RateExperiment pass/fail ratioTrack chaos experiment reliability
MTTRRecovery time during chaosMeasure system recovery speed
AvailabilityService uptime percentageMonitor service health during tests
Resource UsageCPU/Memory utilizationTrack resource impact of chaos

Alert Rules

Monitor chaos experiment health:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
groups:
- name: chaos-engineering
  rules:
  - alert: ChaosExperimentFailed
    expr: litmuschaos_experiment_failed_total > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Chaos experiment  failed"
      
  - alert: ChaosExperimentStuck
    expr: litmuschaos_experiment_awaited_total > 300
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Chaos experiment stuck for >5 minutes"

Production Deployment Patterns

Namespace Isolation

Deploy chaos operator per namespace for blast radius control:

1
2
3
4
5
6
7
8
9
10
11
# Install operator in staging namespace
helm install litmus-staging litmuschaos/litmus \
  --namespace=staging-chaos \
  --set chaos.enabled=true \
  --set portal.enabled=false

# Install operator in production namespace  
helm install litmus-prod litmuschaos/litmus \
  --namespace=production-chaos \
  --set chaos.enabled=true \
  --set portal.enabled=false

RBAC Configuration

Limit chaos scope with proper permissions:

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: chaos-engineer
rules:
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines", "chaosexperiments", "chaosresults"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete", "get", "list"]
  resourceNames: ["app-*"]  # Only target app pods

Multi-Cluster Setup

Central control plane with distributed execution:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Control plane cluster
chaos-center:
  enabled: true
  server:
    service:
      type: LoadBalancer

# Execution plane clusters  
chaos-agent:
  enabled: true
  controlPlane:
    endpoint: "https://chaos-center.example.com"
    accessKey: "agent-access-key"

CI/CD Integration

GitOps Workflow

Store experiments in Git for version control:

1
2
3
4
5
6
7
8
9
10
11
chaos-experiments/
├── staging/
│   ├── pod-delete.yaml
│   ├── memory-hog.yaml
│   └── network-latency.yaml
├── production/
│   ├── pod-delete.yaml
│   └── network-partition.yaml
└── workflows/
    ├── weekly-gameday.yaml
    └── release-validation.yaml

GitHub Actions Integration

Run chaos tests in CI pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
name: Chaos Testing
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 10 * * 1'  # Weekly on Monday

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
      
    - name: Run staging chaos tests
      run: |
        kubectl apply -f chaos-experiments/staging/
        kubectl wait --for=condition=complete \
          chaosresult/nginx-chaos-pod-delete \
          --timeout=300s
          
    - name: Check experiment results
      run: |
        RESULT=$(kubectl get chaosresult nginx-chaos-pod-delete \
          -o jsonpath='{.spec.experimentstatus.verdict}')
        if [ "$RESULT" != "Pass" ]; then
          echo "Chaos experiment failed"
          exit 1
        fi

Deployment Gates

Block deployments if chaos tests fail:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Azure DevOps pipeline
- stage: ChaosValidation
  dependsOn: Deployment
  jobs:
  - job: RunChaosTests
    steps:
    - task: Kubernetes@1
      inputs:
        command: apply
        arguments: -f $(System.DefaultWorkingDirectory)/chaos/
        
    - task: Kubernetes@1
      inputs:
        command: wait
        arguments: --for=condition=complete chaosresult/app-chaos-result --timeout=300s
        
    - powershell: |
        $result = kubectl get chaosresult app-chaos-result -o jsonpath='{.spec.experimentstatus.verdict}'
        if ($result -ne "Pass") {
          Write-Host "##vso[task.logissue type=error]Chaos test failed: $result"
          exit 1
        }

Security Considerations

Least Privilege Access

Grant minimal permissions for chaos operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-runner
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: chaos-runner
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete", "get", "list"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]  
  verbs: ["get", "list"]

Network Policies

Restrict chaos agent network access:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-agent-policy
spec:
  podSelector:
    matchLabels:
      app: chaos-agent
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: TCP
      port: 443  # Kubernetes API
  - to:
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP  
      port: 9090  # Metrics export

Audit Logging

Track all chaos activities:

1
2
3
4
5
6
7
8
9
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  namespaces: ["default", "production"]
  resources:
  - group: "litmuschaos.io"
    resources: ["chaosengines", "chaosexperiments"]
  verbs: ["create", "update", "patch", "delete"]

Troubleshooting Common Issues

Experiment Stuck in Running State

Check experiment pod logs:

1
kubectl logs -l experiment=pod-delete -n litmus

Common causes:

IssueCauseSolution
Permission deniedInsufficient RBACCheck service account permissions
No targets foundWrong label selectorsVerify app labels and selectors
Connection timeoutNetwork issuesCheck cluster connectivity

Probe Failures

Debug probe configuration:

1
kubectl describe chaosresult experiment-name

Check probe endpoints:

1
2
kubectl run debug --image=curlimages/curl --rm -it -- \
  curl -v http://service-name:80/health

Resource Cleanup

Remove stuck experiments:

1
2
3
4
kubectl patch chaosengine experiment-name \
  --type merge -p '{"spec":{"engineState":"stop"}}'
  
kubectl delete chaosengine experiment-name --force --grace-period=0

Best Practices for SRE Teams

PracticeActionsBenefits
Start SmallBegin with non-production, single pod failures, low-traffic periodsMinimize risk, build confidence
Automate EverythingVersion control experiments, GitOps deployment, automated analysisConsistent execution, repeatable results
Document LearningsRecord outcomes, document weaknesses, share knowledgeTeam learning, improved runbooks
Measure ImpactTrack MTTR, monitor availability, quantify improvementsData-driven reliability gains

Getting Started with LitmusChaos

Installation Options

Helm Installation (Recommended for production):

1
2
3
4
5
6
7
8
9
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install with custom values
helm install litmus litmuschaos/litmus \
  --namespace=litmus \
  --create-namespace \
  --set portal.server.service.type=LoadBalancer \
  --set mongodb.service.port=27017

Kubectl Installation (Quick start):

1
kubectl apply -f https://litmuschaos.github.io/litmus/2.14.0/litmus-2.14.0.yaml

Verify Installation:

1
2
kubectl get pods -n litmus
kubectl get crds | grep chaos

First Experiment

Create a simple pod deletion experiment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Install experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/experiment.yaml

# Create chaos engine
cat <<EOF | kubectl apply -f -
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '30'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'false'
EOF

Check experiment results:

1
2
kubectl get chaosresult
kubectl describe chaosresult nginx-chaos-pod-delete

Hands-on Demo: Litmus on Azure AKS

I built a complete demo to get you started fast:

🔗 Litmus Chaos Azure Demo

What’s included:

ComponentDescriptionValue for SRE Teams
AKS SetupComplete cluster configurationProduction-ready environment
Installation GuideStep-by-step Litmus deploymentQuick start implementation
SRE ExperimentsPod failures, network chaos, node disruptionsReal-world test scenarios
Azure MonitorObservability integrationEnterprise monitoring
Incident ScenariosBased on actual production incidentsProven failure patterns
TemplatesInfrastructure and experiment templatesAccelerated deployment

Built for SRE teams who want to start chaos engineering today.

LitmusChaos transforms how SRE teams approach reliability testing. By systematically introducing failures, teams build confidence in their systems and discover weaknesses before customers do. Start small, automate everything, and measure the impact on your reliability metrics.

Litmus Chaos Engineering

This post is licensed under CC BY 4.0 by the author.