Litmus Chaos - Enterprise-Grade Chaos Engineering for Kubernetes
~/posts/litmuschaos.md9 min · 1770 words

Litmus Chaos - Enterprise-Grade Chaos Engineering for Kubernetes

// Discover Litmus, a CNCF incubating project that provides a complete chaos engineering platform with GitOps integration, extensive experiment library, and enterprise features for building resilient cloud-native applications.

$ date

Chaos engineering validates system resilience by introducing controlled failures. LitmusChaos makes this practice accessible for Kubernetes environments, giving SRE teams the tools to test failure scenarios before they happen in production.

Why Chaos Engineering Matters for SREs#

Real failures don’t wait for convenient times. Networks partition during peak traffic. Nodes fail during deployments. Memory leaks surface under load. Traditional testing catches functional bugs but misses reliability gaps.

LitmusChaos helps answer critical SRE questions:

  • Does your app recover when pods get terminated?
  • How does network latency affect user experience?
  • Can your system handle node failures gracefully?
  • Do circuit breakers work under actual load?

Architecture Deep Dive#

LitmusChaos follows a control plane and execution plane architecture:

Control Plane (Chaos Center)#

The control plane manages experiment lifecycle:

  • Portal: Web UI for experiment design and scheduling
  • Authentication: RBAC integration with Kubernetes
  • Workflow Engine: Orchestrates complex experiment sequences
  • Data Plane: Stores experiment results and metrics

Execution Plane#

The execution plane runs experiments:

  • Chaos Agent: Deployed per cluster, executes experiments
  • Chaos Operator: Manages Custom Resource lifecycle
  • Experiment Pods: Run actual chaos injection logic
  • Chaos Exporter: Pushes metrics to monitoring systems

This separation means you can manage experiments centrally while running them across multiple clusters.

Core Components Explained#

ChaosExperiment: The Experiment Template#

ChaosExperiment defines what chaos to inject. Think of it as a reusable template:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: pod-memory-hog
  labels:
    instance: litmus
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create","delete","get","list","patch","update"]
    image: "litmuschaos/go-runner:latest"
    args:
    - -c
    - ./experiments -name pod-memory-hog
    command:
    - /bin/bash
    env:
    - name: MEMORY_CONSUMPTION
      value: '500'
    - name: TOTAL_CHAOS_DURATION
      value: '60'
    labels:
      experiment: pod-memory-hog

Key fields:

FieldPurposeDescription
permissionsRBAC rulesKubernetes permissions the experiment needs
imageContainer imageDocker image that runs the chaos logic
env variablesConfigurationParameters to configure experiment behavior
scopeAccess levelCluster or Namespace level permissions

ChaosEngine: The Experiment Executor#

ChaosEngine links experiments to target workloads:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-memory-chaos
spec:
  engineState: 'active'
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-hog
    spec:
      components:
        env:
        - name: MEMORY_CONSUMPTION
          value: '1000'
        - name: PODS_AFFECTED_PERC
          value: '50'
      probe:
      - name: "nginx-health-check"
        type: "httpProbe"
        mode: "Continuous"
        runProperties:
          probeTimeout: 10s
          interval: 5s
          retry: 3
        httpProbe/inputs:
          url: "http://nginx-service:80/health"
          method:
            get:
              criteria: "=="
              responseCode: "200"

Important concepts:

ConceptPurposeDetails
appinfoTarget selectionSelects target workload using labels and namespace
chaosServiceAccountPermissionsService account with required RBAC permissions
probeHealth monitoringContinuous health checks during experiments
components.envParameter overrideOverride default experiment parameters

ChaosResult: The Experiment Outcome#

ChaosResult captures experiment execution data:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
  name: nginx-memory-chaos-pod-memory-hog
spec:
  experimentstatus:
    phase: "Completed"
    verdict: "Pass"
  history:
    passedRuns: 1
    failedRuns: 0
    stoppedRuns: 0
  probesuccesspercentage: "100"

Experiment Categories for SRE Teams#

Resource Chaos#

Test how apps behave when compute resources are constrained:

CPU Hog: Consume CPU cycles to simulate high load

- name: TOTAL_CHAOS_DURATION
  value: '120'
- name: CPU_CORES  
  value: '2'
- name: PODS_AFFECTED_PERC
  value: '25'

Memory Hog: Fill up pod memory to test OOM handling

- name: MEMORY_CONSUMPTION
  value: '500'  # MB
- name: PODS_AFFECTED_PERC  
  value: '50'

Network Chaos#

Simulate network issues that cause real outages:

Network Loss: Drop packets to test retry logic

- name: NETWORK_PACKET_LOSS_PERCENTAGE
  value: '5'
- name: DESTINATION_IPS
  value: 'service-b.namespace.svc.cluster.local'

Network Latency: Add delays to test timeout handling

- name: NETWORK_LATENCY
  value: '2000'  # milliseconds
- name: JITTER
  value: '200'

Network Partition: Block traffic between services

- name: DESTINATION_IPS
  value: 'database.production.svc.cluster.local'
- name: NETWORK_PACKET_LOSS_PERCENTAGE
  value: '100'

Pod Chaos#

Test application resilience patterns:

Pod Delete: Verify graceful shutdown and startup

- name: FORCE
  value: 'false'  # Graceful termination
- name: CHAOS_INTERVAL
  value: '30'     # Seconds between deletions

Container Kill: Test signal handling

- name: TARGET_CONTAINER
  value: 'app-container'
- name: SIGNAL
  value: 'SIGKILL'

Node Chaos#

Test cluster-level resilience:

Node Drain: Simulate node maintenance

- name: TARGET_NODE
  value: 'worker-node-1'
- name: FORCE
  value: 'false'

Disk Fill: Test disk space monitoring

- name: FILL_PERCENTAGE
  value: '80'
- name: EPHEMERAL_STORAGE_MEBIBYTES
  value: '1000'

Advanced SRE Use Cases#

Game Day Scenarios#

Create complex failure scenarios that mirror real incidents:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: black-friday-gameday
spec:
  entrypoint: gameday-scenario
  templates:
  - name: gameday-scenario
    steps:
    - - name: baseline-check
        template: health-probe
    - - name: traffic-spike
        template: load-generator
      - name: database-latency
        template: network-chaos
    - - name: pod-failures
        template: pod-delete-chaos
    - - name: recovery-check  
        template: health-probe

Progressive Chaos#

Start small and increase blast radius:

# Week 1: Single pod
- name: PODS_AFFECTED_PERC
  value: '10'

# Week 2: Multiple pods  
- name: PODS_AFFECTED_PERC
  value: '25'

# Week 3: Majority of pods
- name: PODS_AFFECTED_PERC
  value: '60'

Multi-Service Dependencies#

Test cascade failures across service boundaries:

spec:
  experiments:
  - name: service-a-pod-delete
    spec:
      probe:
      - name: service-b-health
        type: httpProbe
      - name: service-c-health  
        type: httpProbe

Observability Integration#

Prometheus Metrics#

LitmusChaos exports key metrics:

# Experiment results
litmuschaos_experiment_passed_total
litmuschaos_experiment_failed_total
litmuschaos_experiment_awaited_total

# Experiment duration
litmuschaos_experiment_duration_seconds

# Probe success rate
litmuschaos_probe_success_percentage

Query experiment success rate:

rate(litmuschaos_experiment_passed_total[5m]) / 
rate(litmuschaos_experiment_total[5m]) * 100

Grafana Dashboards#

Essential panels for SRE teams:

PanelMetricPurpose
Success RateExperiment pass/fail ratioTrack chaos experiment reliability
MTTRRecovery time during chaosMeasure system recovery speed
AvailabilityService uptime percentageMonitor service health during tests
Resource UsageCPU/Memory utilizationTrack resource impact of chaos

Alert Rules#

Monitor chaos experiment health:

groups:
- name: chaos-engineering
  rules:
  - alert: ChaosExperimentFailed
    expr: litmuschaos_experiment_failed_total > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Chaos experiment {{ $labels.experiment }} failed"
      
  - alert: ChaosExperimentStuck
    expr: litmuschaos_experiment_awaited_total > 300
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Chaos experiment stuck for >5 minutes"

Production Deployment Patterns#

Namespace Isolation#

Deploy chaos operator per namespace for blast radius control:

# Install operator in staging namespace
helm install litmus-staging litmuschaos/litmus \
  --namespace=staging-chaos \
  --set chaos.enabled=true \
  --set portal.enabled=false

# Install operator in production namespace  
helm install litmus-prod litmuschaos/litmus \
  --namespace=production-chaos \
  --set chaos.enabled=true \
  --set portal.enabled=false

RBAC Configuration#

Limit chaos scope with proper permissions:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: chaos-engineer
rules:
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines", "chaosexperiments", "chaosresults"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete", "get", "list"]
  resourceNames: ["app-*"]  # Only target app pods

Multi-Cluster Setup#

Central control plane with distributed execution:

# Control plane cluster
chaos-center:
  enabled: true
  server:
    service:
      type: LoadBalancer

# Execution plane clusters  
chaos-agent:
  enabled: true
  controlPlane:
    endpoint: "https://chaos-center.example.com"
    accessKey: "agent-access-key"

CI/CD Integration#

GitOps Workflow#

Store experiments in Git for version control:

chaos-experiments/
├── staging/
│   ├── pod-delete.yaml
│   ├── memory-hog.yaml
│   └── network-latency.yaml
├── production/
│   ├── pod-delete.yaml
│   └── network-partition.yaml
└── workflows/
    ├── weekly-gameday.yaml
    └── release-validation.yaml

GitHub Actions Integration#

Run chaos tests in CI pipeline:

name: Chaos Testing
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 10 * * 1'  # Weekly on Monday

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Setup kubectl
      uses: azure/setup-kubectl@v3
      
    - name: Run staging chaos tests
      run: |
        kubectl apply -f chaos-experiments/staging/
        kubectl wait --for=condition=complete \
          chaosresult/nginx-chaos-pod-delete \
          --timeout=300s
          
    - name: Check experiment results
      run: |
        RESULT=$(kubectl get chaosresult nginx-chaos-pod-delete \
          -o jsonpath='{.spec.experimentstatus.verdict}')
        if [ "$RESULT" != "Pass" ]; then
          echo "Chaos experiment failed"
          exit 1
        fi

Deployment Gates#

Block deployments if chaos tests fail:

# Azure DevOps pipeline
- stage: ChaosValidation
  dependsOn: Deployment
  jobs:
  - job: RunChaosTests
    steps:
    - task: Kubernetes@1
      inputs:
        command: apply
        arguments: -f $(System.DefaultWorkingDirectory)/chaos/
        
    - task: Kubernetes@1
      inputs:
        command: wait
        arguments: --for=condition=complete chaosresult/app-chaos-result --timeout=300s
        
    - powershell: |
        $result = kubectl get chaosresult app-chaos-result -o jsonpath='{.spec.experimentstatus.verdict}'
        if ($result -ne "Pass") {
          Write-Host "##vso[task.logissue type=error]Chaos test failed: $result"
          exit 1
        }

Security Considerations#

Least Privilege Access#

Grant minimal permissions for chaos operations:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-runner
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: chaos-runner
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete", "get", "list"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]  
  verbs: ["get", "list"]

Network Policies#

Restrict chaos agent network access:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: chaos-agent-policy
spec:
  podSelector:
    matchLabels:
      app: chaos-agent
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: TCP
      port: 443  # Kubernetes API
  - to:
    - podSelector:
        matchLabels:
          app: prometheus
    ports:
    - protocol: TCP  
      port: 9090  # Metrics export

Audit Logging#

Track all chaos activities:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
  namespaces: ["default", "production"]
  resources:
  - group: "litmuschaos.io"
    resources: ["chaosengines", "chaosexperiments"]
  verbs: ["create", "update", "patch", "delete"]

Troubleshooting Common Issues#

Experiment Stuck in Running State#

Check experiment pod logs:

kubectl logs -l experiment=pod-delete -n litmus

Common causes:

IssueCauseSolution
Permission deniedInsufficient RBACCheck service account permissions
No targets foundWrong label selectorsVerify app labels and selectors
Connection timeoutNetwork issuesCheck cluster connectivity

Probe Failures#

Debug probe configuration:

kubectl describe chaosresult experiment-name

Check probe endpoints:

kubectl run debug --image=curlimages/curl --rm -it -- \
  curl -v http://service-name:80/health

Resource Cleanup#

Remove stuck experiments:

kubectl patch chaosengine experiment-name \
  --type merge -p '{"spec":{"engineState":"stop"}}'
  
kubectl delete chaosengine experiment-name --force --grace-period=0

Best Practices for SRE Teams#

PracticeActionsBenefits
Start SmallBegin with non-production, single pod failures, low-traffic periodsMinimize risk, build confidence
Automate EverythingVersion control experiments, GitOps deployment, automated analysisConsistent execution, repeatable results
Document LearningsRecord outcomes, document weaknesses, share knowledgeTeam learning, improved runbooks
Measure ImpactTrack MTTR, monitor availability, quantify improvementsData-driven reliability gains

Getting Started with LitmusChaos#

Installation Options#

Helm Installation (Recommended for production):

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install with custom values
helm install litmus litmuschaos/litmus \
  --namespace=litmus \
  --create-namespace \
  --set portal.server.service.type=LoadBalancer \
  --set mongodb.service.port=27017

Kubectl Installation (Quick start):

kubectl apply -f https://litmuschaos.github.io/litmus/2.14.0/litmus-2.14.0.yaml

Verify Installation:

kubectl get pods -n litmus
kubectl get crds | grep chaos

First Experiment#

Create a simple pod deletion experiment:

# Install experiment definition
kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/experiment.yaml

# Create chaos engine
cat <<EOF | kubectl apply -f -
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '30'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'false'
EOF

Check experiment results:

kubectl get chaosresult
kubectl describe chaosresult nginx-chaos-pod-delete

Hands-on Demo: Litmus on Azure AKS#

I built a complete demo to get you started fast:

🔗 Litmus Chaos Azure Demo

What’s included:

ComponentDescriptionValue for SRE Teams
AKS SetupComplete cluster configurationProduction-ready environment
Installation GuideStep-by-step Litmus deploymentQuick start implementation
SRE ExperimentsPod failures, network chaos, node disruptionsReal-world test scenarios
Azure MonitorObservability integrationEnterprise monitoring
Incident ScenariosBased on actual production incidentsProven failure patterns
TemplatesInfrastructure and experiment templatesAccelerated deployment

Built for SRE teams who want to start chaos engineering today.

LitmusChaos transforms how SRE teams approach reliability testing. By systematically introducing failures, teams build confidence in their systems and discover weaknesses before customers do. Start small, automate everything, and measure the impact on your reliability metrics.

Litmus Chaos Engineering

EOF · 9 min · 1770 words
$ continue exploring
Kubernetes Autoscaling Beyond HPA - Event-Driven KEDA and VPA Resource Optimization // Master advanced Kubernetes scaling strategies with KEDA for event-driven autoscaling and VPA for resource optimization. Real production examples, cost savings, and complete implementation guide for SRE teams. #sre #kubernetes #keda Cilium Tetragon - Next-Generation Runtime Security for Kubernetes // Explore Cilium Tetragon, an eBPF-based security observability and runtime enforcement platform that provides kernel-level protection for Kubernetes workloads with minimal overhead. #sre #kubernetes #security
// author
Nikos Nikolakakis
Nikos Nikolakakis Principal SRE & Platform Engineer // Writing about Kubernetes, SRE practices, and cloud-native infrastructure
$ exit logout connection closed. cd ~/home ↵
ESC
Type to search...