Litmus Chaos - Enterprise-Grade Chaos Engineering for Kubernetes

Posted Aug 17, 2025

By Nikos Nikolakakis 11 min read

Chaos engineering validates system resilience by introducing controlled failures. LitmusChaos makes this practice accessible for Kubernetes environments, giving SRE teams the tools to test failure scenarios before they happen in production.

Why Chaos Engineering Matters for SREs

Real failures don’t wait for convenient times. Networks partition during peak traffic. Nodes fail during deployments. Memory leaks surface under load. Traditional testing catches functional bugs but misses reliability gaps.

LitmusChaos helps answer critical SRE questions:

Does your app recover when pods get terminated?
How does network latency affect user experience?
Can your system handle node failures gracefully?
Do circuit breakers work under actual load?

Architecture Deep Dive

LitmusChaos follows a control plane and execution plane architecture:

Control Plane (Chaos Center)

The control plane manages experiment lifecycle:

Portal: Web UI for experiment design and scheduling
Authentication: RBAC integration with Kubernetes
Workflow Engine: Orchestrates complex experiment sequences
Data Plane: Stores experiment results and metrics

Execution Plane

The execution plane runs experiments:

Chaos Agent: Deployed per cluster, executes experiments
Chaos Operator: Manages Custom Resource lifecycle
Experiment Pods: Run actual chaos injection logic
Chaos Exporter: Pushes metrics to monitoring systems

This separation means you can manage experiments centrally while running them across multiple clusters.

Core Components Explained

ChaosExperiment: The Experiment Template

ChaosExperiment defines what chaos to inject. Think of it as a reusable template:

  
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-memory-hog
  labels:
    instance: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create","delete","get","list","patch","update"]
    image: "litmuschaos/go-runner:latest"
    args:
    - -c
    - ./experiments -name pod-memory-hog
    command:
    - /bin/bash
    env:
    - name: MEMORY_CONSUMPTION
      value: '500'
    - name: TOTAL_CHAOS_DURATION
      value: '60'
    labels:
      experiment: pod-memory-hog

Key fields:

Field	Purpose	Description
permissions	RBAC rules	Kubernetes permissions the experiment needs
image	Container image	Docker image that runs the chaos logic
env variables	Configuration	Parameters to configure experiment behavior
scope	Access level	Cluster or Namespace level permissions

ChaosEngine: The Experiment Executor

ChaosEngine links experiments to target workloads:

  
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-memory-chaos
spec:
  engineState: 'active'
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-hog
    spec:
      components:
        env:
        - name: MEMORY_CONSUMPTION
          value: '1000'
        - name: PODS_AFFECTED_PERC
          value: '50'
      probe:
      - name: "nginx-health-check"
        type: "httpProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          interval: 5s
          retry: 3
        httpProbe/inputs:
          url: "http://nginx-service:80/health"
          method:
            get:
              criteria: "=="
              responseCode: "200"

Important concepts:

Concept	Purpose	Details
appinfo	Target selection	Selects target workload using labels and namespace
chaosServiceAccount	Permissions	Service account with required RBAC permissions
probe	Health monitoring	Continuous health checks during experiments
components.env	Parameter override	Override default experiment parameters

ChaosResult: The Experiment Outcome

ChaosResult captures experiment execution data:

  
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
  name: nginx-memory-chaos-pod-memory-hog
spec:
  experimentstatus:
    phase: "Completed"
    verdict: "Pass"
  history:
    passedRuns: 1
    failedRuns: 0
    stoppedRuns: 0
  probesuccesspercentage: "100"

Experiment Categories for SRE Teams

Resource Chaos

Test how apps behave when compute resources are constrained:

CPU Hog: Consume CPU cycles to simulate high load

  
- name: TOTAL_CHAOS_DURATION
  value: '120'
- name: CPU_CORES  
  value: '2'
- name: PODS_AFFECTED_PERC
  value: '25'

Memory Hog: Fill up pod memory to test OOM handling

  
- name: MEMORY_CONSUMPTION
  value: '500'  # MB
- name: PODS_AFFECTED_PERC  
  value: '50'

Network Chaos

Simulate network issues that cause real outages:

Network Loss: Drop packets to test retry logic

  
- name: NETWORK_PACKET_LOSS_PERCENTAGE
  value: '5'
- name: DESTINATION_IPS
  value: 'service-b.namespace.svc.cluster.local'

Network Latency: Add delays to test timeout handling

  
- name: NETWORK_LATENCY
  value: '2000'  # milliseconds
- name: JITTER
  value: '200'

Network Partition: Block traffic between services

  
- name: DESTINATION_IPS
  value: 'database.production.svc.cluster.local'
- name: NETWORK_PACKET_LOSS_PERCENTAGE
  value: '100'

Pod Chaos

Test application resilience patterns:

Pod Delete: Verify graceful shutdown and startup

  
- name: FORCE
  value: 'false'  # Graceful termination
- name: CHAOS_INTERVAL
  value: '30'     # Seconds between deletions

Container Kill: Test signal handling

  
- name: TARGET_CONTAINER
  value: 'app-container'
- name: SIGNAL
  value: 'SIGKILL'

Node Chaos

Test cluster-level resilience:

Node Drain: Simulate node maintenance

  
- name: TARGET_NODE
  value: 'worker-node-1'
- name: FORCE
  value: 'false'

Disk Fill: Test disk space monitoring

  
- name: FILL_PERCENTAGE
  value: '80'
- name: EPHEMERAL_STORAGE_MEBIBYTES
  value: '1000'

Advanced SRE Use Cases

Game Day Scenarios

Create complex failure scenarios that mirror real incidents:

  
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: black-friday-gameday
spec:
  entrypoint: gameday-scenario
  templates:
  - name: gameday-scenario
    steps:
    - - name: baseline-check
        template: health-probe
    - - name: traffic-spike
        template: load-generator
      - name: database-latency
        template: network-chaos
    - - name: pod-failures
        template: pod-delete-chaos
    - - name: recovery-check  
        template: health-probe

Progressive Chaos

Start small and increase blast radius:

  
# Week 1: Single pod
- name: PODS_AFFECTED_PERC
  value: '10'

# Week 2: Multiple pods  
- name: PODS_AFFECTED_PERC
  value: '25'

# Week 3: Majority of pods
- name: PODS_AFFECTED_PERC
  value: '60'

Multi-Service Dependencies

Test cascade failures across service boundaries:

  
spec:
  experiments:
  - name: service-a-pod-delete
    spec:
      probe:
      - name: service-b-health
        type: httpProbe
      - name: service-c-health  
        type: httpProbe

Observability Integration

Prometheus Metrics

LitmusChaos exports key metrics:

  
# Experiment results
litmuschaos_experiment_passed_total
litmuschaos_experiment_failed_total
litmuschaos_experiment_awaited_total

# Experiment duration
litmuschaos_experiment_duration_seconds

# Probe success rate
litmuschaos_probe_success_percentage

Query experiment success rate:

rate(litmuschaos_experiment_passed_total[5m]) / 
rate(litmuschaos_experiment_total[5m]) * 100

Grafana Dashboards

Essential panels for SRE teams:

Panel	Metric	Purpose
Success Rate	Experiment pass/fail ratio	Track chaos experiment reliability
MTTR	Recovery time during chaos	Measure system recovery speed
Availability	Service uptime percentage	Monitor service health during tests
Resource Usage	CPU/Memory utilization	Track resource impact of chaos

Alert Rules

Monitor chaos experiment health:

  
groups:
- name: chaos-engineering
  rules:
  - alert: ChaosExperimentFailed
    expr: litmuschaos_experiment_failed_total > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Chaos experiment  failed"
      
  - alert: ChaosExperimentStuck
    expr: litmuschaos_experiment_awaited_total > 300
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Chaos experiment stuck for >5 minutes"

Production Deployment Patterns

Namespace Isolation

Deploy chaos operator per namespace for blast radius control:

  
# Install operator in staging namespace
helm install litmus-staging litmuschaos/litmus \
  --namespace=staging-chaos \
  --set chaos.enabled=true \
  --set portal.enabled=false

# Install operator in production namespace  
helm install litmus-prod litmuschaos/litmus \
  --namespace=production-chaos \
  --set chaos.enabled=true \
  --set portal.enabled=false

RBAC Configuration

Limit chaos scope with proper permissions:

  
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: chaos-engineer
rules:
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines", "chaosexperiments", "chaosresults"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete", "get", "list"]
  resourceNames: ["app-*"]  # Only target app pods

Multi-Cluster Setup

Central control plane with distributed execution:

  
# Control plane cluster
chaos-center:
  enabled: true
  server:
    service:
      type: LoadBalancer

# Execution plane clusters  
chaos-agent:
  enabled: true
  controlPlane:
    endpoint: "https://chaos-center.example.com"
    accessKey: "agent-access-key"

CI/CD Integration

GitOps Workflow

Store experiments in Git for version control:

chaos-experiments/
├── staging/
│   ├── pod-delete.yaml
│   ├── memory-hog.yaml
│   └── network-latency.yaml
├── production/
│   ├── pod-delete.yaml
│   └── network-partition.yaml
└── workflows/
    ├── weekly-gameday.yaml
    └── release-validation.yaml

GitHub Actions Integration

Run chaos tests in CI pipeline:

  
name: Chaos Testing
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 10 * * 1'  # Weekly on Monday

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
      
    - name: Run staging chaos tests
      run: |
        kubectl apply -f chaos-experiments/staging/
        kubectl wait --for=condition=complete \
          chaosresult/nginx-chaos-pod-delete \
          --timeout=300s
          
    - name: Check experiment results
      run: |
        RESULT=$(kubectl get chaosresult nginx-chaos-pod-delete \
          -o jsonpath='{.spec.experimentstatus.verdict}')
        if [ "$RESULT" != "Pass" ]; then
          echo "Chaos experiment failed"
          exit 1
        fi

Deployment Gates

Block deployments if chaos tests fail:

  
# Azure DevOps pipeline
- stage: ChaosValidation
  dependsOn: Deployment
  jobs:
  - job: RunChaosTests
    steps:
    - task: Kubernetes@1
      inputs:
        command: apply
        arguments: -f $(System.DefaultWorkingDirectory)/chaos/
        
    - task: Kubernetes@1
      inputs:
        command: wait
        arguments: --for=condition=complete chaosresult/app-chaos-result --timeout=300s
        
    - powershell: |
        $result = kubectl get chaosresult app-chaos-result -o jsonpath='{.spec.experimentstatus.verdict}'
        if ($result -ne "Pass") {
          Write-Host "##vso[task.logissue type=error]Chaos test failed: $result"
          exit 1
        }

Security Considerations

Least Privilege Access

Grant minimal permissions for chaos operations:

  
apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-runner
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: chaos-runner
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete", "get", "list"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]  
  verbs: ["get", "list"]

Network Policies

Restrict chaos agent network access:

  
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-agent-policy
spec:
  podSelector:
    matchLabels:
      app: chaos-agent
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: TCP
      port: 443  # Kubernetes API
  - to:
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP  
      port: 9090  # Metrics export

Audit Logging

Track all chaos activities:

  
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  namespaces: ["default", "production"]
  resources:
  - group: "litmuschaos.io"
    resources: ["chaosengines", "chaosexperiments"]
  verbs: ["create", "update", "patch", "delete"]

Troubleshooting Common Issues

Experiment Stuck in Running State

Check experiment pod logs:

  
kubectl logs -l experiment=pod-delete -n litmus

Common causes:

Issue	Cause	Solution
Permission denied	Insufficient RBAC	Check service account permissions
No targets found	Wrong label selectors	Verify app labels and selectors
Connection timeout	Network issues	Check cluster connectivity

Probe Failures

Debug probe configuration:

kubectl describe chaosresult experiment-name

Check probe endpoints:

  
kubectl run debug --image=curlimages/curl --rm -it -- \
  curl -v http://service-name:80/health

Resource Cleanup

Remove stuck experiments:

  
kubectl patch chaosengine experiment-name \
  --type merge -p '{"spec":{"engineState":"stop"}}'
  
kubectl delete chaosengine experiment-name --force --grace-period=0

Best Practices for SRE Teams

Practice	Actions	Benefits
Start Small	Begin with non-production, single pod failures, low-traffic periods	Minimize risk, build confidence
Automate Everything	Version control experiments, GitOps deployment, automated analysis	Consistent execution, repeatable results
Document Learnings	Record outcomes, document weaknesses, share knowledge	Team learning, improved runbooks
Measure Impact	Track MTTR, monitor availability, quantify improvements	Data-driven reliability gains

Getting Started with LitmusChaos

Installation Options

Helm Installation (Recommended for production):

  
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install with custom values
helm install litmus litmuschaos/litmus \
  --namespace=litmus \
  --create-namespace \
  --set portal.server.service.type=LoadBalancer \
  --set mongodb.service.port=27017

Kubectl Installation (Quick start):

kubectl apply -f https://litmuschaos.github.io/litmus/2.14.0/litmus-2.14.0.yaml

Verify Installation:

kubectl get pods -n litmus
kubectl get crds | grep chaos

First Experiment

Create a simple pod deletion experiment:

  
# Install experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/experiment.yaml

# Create chaos engine
cat <<EOF | kubectl apply -f -
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '30'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'false'
EOF

Check experiment results:

kubectl get chaosresult
kubectl describe chaosresult nginx-chaos-pod-delete

Hands-on Demo: Litmus on Azure AKS

I built a complete demo to get you started fast:

🔗 Litmus Chaos Azure Demo

What’s included:

Component	Description	Value for SRE Teams
AKS Setup	Complete cluster configuration	Production-ready environment
Installation Guide	Step-by-step Litmus deployment	Quick start implementation
SRE Experiments	Pod failures, network chaos, node disruptions	Real-world test scenarios
Azure Monitor	Observability integration	Enterprise monitoring
Incident Scenarios	Based on actual production incidents	Proven failure patterns
Templates	Infrastructure and experiment templates	Accelerated deployment

Built for SRE teams who want to start chaos engineering today.

LitmusChaos transforms how SRE teams approach reliability testing. By systematically introducing failures, teams build confidence in their systems and discover weaknesses before customers do. Start small, automate everything, and measure the impact on your reliability metrics.

This post is licensed under CC BY 4.0 by the author.