Post

Kubernetes Autoscaling Beyond HPA - Event-Driven KEDA and VPA Resource Optimization

I have spent years fighting the same “battle”: pods running out of memory at 3 AM, or paying for idle resources sitting at 10% utilization. HPA works great when your traffic follows a predictable curve, but what happens when you are processing background jobs, running analytics during business hours, or dealing with sudden database load spikes?

The built-in Horizontal Pod Autoscaler does one thing well. It scales pods based on CPU and memory. That’s fine for basic web apps, but completely useless when you need to scale based on queue depth, database connections, or time based patterns. I have watched teams write custom controllers and hacky scripts just to solve these problems.

KEDA and VPA changed that. KEDA brings event-driven autoscaling to Kubernetes. Finally, you can scale based on what actually matters to your application. VPA tackles the other half of the problem, automatically right-sizing your resource requests so you are not guessing anymore.

Who This Guide Is For

This guide is written for:

  • SRE teams managing production Kubernetes clusters looking to optimize costs and reliability
  • Platform engineers building internal developer platforms with advanced autoscaling
  • DevOps practitioners already familiar with basic HPA who need event-driven scaling
  • Teams running batch jobs, database-heavy applications, or time-sensitive workloads

Prerequisites: Basic understanding of Kubernetes, HPA, and Deployments. If you are new to Kubernetes autoscaling, start with the official HPA documentation first.

Try It Yourself

All examples in this post are available in a working demo repository with load testing and infrastructure templates:

Kubernetes Autoscaling Demo

What’s included:

  • PostgreSQL, Prometheus, Cron, and HTTP metric scalers
  • VPA configurations for all update modes
  • HPA + KEDA + VPA integration examples
  • Terraform templates for AKS (Azure Kubernetes Service)
  • Load testing scripts to validate scaling behavior

Summary

  • HPA scales horizontally based on CPU/memory, but lacks event-driven capabilities
  • KEDA enables scaling based on external events (queues, streams, databases)
  • VPA optimizes resource requests/limits vertically without horizontal scaling
  • Combining HPA, KEDA, and VPA creates a comprehensive autoscaling strategy
  • Try the demo: Kubernetes Autoscaling Demo

Understanding the Autoscaling Landscape

Before diving into KEDA and VPA, let’s clarify the different scaling dimensions:

Three ScalingDimensions

Scaling TypeWhat It DoesUse CaseTool
HorizontalAdd/remove pod replicasTraffic spikes, load balancingHPA, KEDA
VerticalAdjust CPU/memory requestsRight-sizing resourcesVPA
ClusterAdd/remove nodesNode capacity exhaustionKarpenter, CA

Why HPA Isn’t Enough

Don’t get me wrong. HPA is solid for what it does. But here’s where it breaks down.

You can’t scale based on queue depth without jumping through hoops with custom metrics adapters. Your batch jobs either run with way too many resources or not enough. Time-based scaling? Forget about it. You will need CronJobs and external scripts. And good luck optimizing those resource requests; everyone just picks numbers and hopes for the best.

I have seen teams set CPU requests at 100m “just in case” when pods actually need 50m, wasting half their cluster capacity. Or they set memory too low and watch OOMKills rain down during peak hours.

That’s the gap KEDA and VPA fill.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF graduated project, which means it’s battle-tested and backed by the community. The core idea is simple: scale based on what actually drives your workload, not just CPU percentages.

Need to scale workers based on how many rows are sitting in your database? Done. Want to spin up pods only during business hours? Easy. Have a spike in Prometheus metrics that indicates real user load? KEDA’s got you.

How KEDA Works

KEDA operates through three core components:

  1. Scaler: Connects to external event sources (60+ built-in scalers)
  2. Metrics Adapter: Exposes event metrics to HPA
  3. Controller: Manages ScaledObject and ScaledJob custom resources
1
2
3
4
5
6
7
8
9
External Event Source (e.g., PostgreSQL)
            ↓
      KEDA Scaler (monitors query results)
            ↓
      Metrics Adapter (exposes metrics)
            ↓
      HPA (scales deployment)
            ↓
      Pods (scaled up/down)

Keda Sequence

KEDA Architecture Deep Dive

KEDA enhances Kubernetes without replacing existing components:

ComponentRoleDescription
OperatorLifecycle ManagementWatches ScaledObject/ScaledJob CRDs
Metrics ServerMetric ProviderImplements HPA External Metrics API
Admission WebhooksValidationEnsures valid ScaledObject configurations
ScalersEvent MonitoringConnects to external sources (AWS SQS, Kafka, etc.)

Note: KEDA doesn’t replace HPA extends it by acting as a metrics adapter. When you create a ScaledObject, KEDA dynamically generates an HPA resource behind the scenes and feeds it custom metrics from external event sources. This means KEDA-managed workloads still use the standard HPA controller, but with event-driven metrics instead of just CPU/memory.

KEDA in Action: Real-World Use Cases

Use Case 1: Database Query Load

Scale API replicas based on pending orders in PostgreSQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: postgres-scaler
  namespace: demo
spec:
  scaleTargetRef:
    name: api-gateway-postgres
  minReplicaCount: 3
  maxReplicaCount: 50
  cooldownPeriod: 300  # 5 minutes cooldown before scaling to min
  pollingInterval: 30  # Check every 30 seconds
  fallback:
    failureThreshold: 3
    replicas: 10  # Safe fallback replica count
  triggers:
    - type: postgresql
      metadata:
        host: "postgresql.demo.svc.cluster.local"
        port: "5432"
        userName: "postgres"
        dbName: "demo"
        sslmode: "disable"
        # Query returns number of pending orders divided by 100
        # Each replica can process ~100 orders
        query: "SELECT CEIL(COUNT(*)/100.0) FROM orders WHERE status = 'pending'"
        targetQueryValue: "5"  # Scale up when result > 5 (more than 500 pending orders)
        activationQueryValue: "1"  # Activate scaling when there's at least 1 pending order batch
      authenticationRef:
        name: postgres-trigger-auth # Reference to secret, never hardcode credentials!

Security Note: Never hardcode database credentials or API tokens in YAML manifests. Always use TriggerAuthentication resources that reference Kubernetes secrets. See the “Installation and Setup” section for examples.

This is incredibly useful when your application processes pending database records. Instead of waiting for CPU to spike (which might be too late), you’re scaling proactively based on actual workload. In this demo, each replica can handle approximately 100 pending orders, so KEDA scales appropriately as the queue grows.

Use Case 2: Cron-Based Batch Processing

Scale workloads based on scheduled time windows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cron-scaler-utc
  namespace: demo
spec:
  scaleTargetRef:
    name: report-generator
  minReplicaCount: 0  # Scale to zero outside business hours
  maxReplicaCount: 15
  triggers:
    - type: cron
      metadata:
        timezone: UTC
        start: "0 6 * * 1-5"  # 6 AM UTC on weekdays
        end: "0 20 * * 1-5"   # 8 PM UTC on weekdays
        desiredReplicas: "15"

Real cost savings story: We implemented this for a financial reporting system that only runs during business hours. Before KEDA, we kept 15 replicas running 24/7 because the manual process of scaling up/down was too error-prone.

Before KEDA:

  • 15 replicas × 24 hours × 30 days = 10,800 replica-hours/month
  • Average cost per replica-hour: €0.04 (Standard_D2s_v3 spot instances on AKS)
  • Monthly cost: €432

After KEDA with cron scaling:

  • Business hours (10 hours/day): 15 replicas × 10 hours × 22 workdays = 3,300 replica-hours
  • Off-hours and weekends: 0 replicas
  • Monthly cost: €132

Result: 69% cost reduction (€300/month savings) on this single workload. The implementation took 2 hours, including testing. ROI was immediate.

Beyond cost, we also eliminated the manual toil of scaling the deployment up/down, which previously required coordination between teams and occasionally caused incidents when someone forgot to scale it back down.

Use Case 3: HTTP Traffic Metrics

Scale based on custom application metrics from Prometheus:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: http-advanced-scaler
  namespace: demo
spec:
  scaleTargetRef:
    name: web-app
  minReplicaCount: 5
  maxReplicaCount: 100
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    # Scale based on HTTP request rate (successful requests only)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
        metricName: http_success_rate
        threshold: "500"
        query: |
          sum(rate(http_request_duration_seconds_count{service="web-app",status=~"2.."}[2m]))

Scaling on request rate is way more accurate than CPU for stateless services. You catch traffic spikes before they turn into performance issues, and you’re not scaling based on some arbitrary CPU threshold that might mean different things on different node types. This demo uses HTTP request duration histograms which are more accurate than simple request counters.

Use Case 4: External HTTP Endpoint Monitoring

Scale based on external API metrics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: metrics-api-scaler
  namespace: demo
spec:
  scaleTargetRef:
    name: data-processor
  minReplicaCount: 2
  maxReplicaCount: 30
  cooldownPeriod: 300  # 5 minutes cooldown before scaling to min
  pollingInterval: 30  # Check metrics every 30 seconds
  fallback:
    failureThreshold: 3
    replicas: 5  # Safe fallback if API is unreachable
  triggers:
    - type: metrics-api
      metadata:
        targetValue: "100"
        url: "http://mock-metrics-api.demo.svc.cluster.local:8080"
        valueLocation: "jobs.pending"
        method: "GET"

This lets you query external monitoring systems or APIs and scale based on business metrics like pending jobs or active user sessions. Perfect for integrating with custom telemetry platforms or bridging legacy systems with Kubernetes autoscaling. The demo includes a mock metrics API that simulates a real external endpoint.

KEDA Advanced Patterns for SREs

Pattern 1: Scale to Zero

KEDA’s killer feature is scaling deployments to zero during idle periods:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-inference-scaler
  namespace: demo
spec:
  scaleTargetRef:
    name: ml-inference
  minReplicaCount: 1  # Start with 1 to generate initial metrics
  maxReplicaCount: 10
  cooldownPeriod: 120  # Wait 2 minutes of inactivity before scaling to min
  pollingInterval: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
        metricName: prediction_requests
        # Scale up if there's at least 1 request in the last 2 minutes
        threshold: "1"
        # Scale down to 0 if no requests
        activationThreshold: "0.1"
        query: |
          sum(rate(http_request_duration_seconds_count{service="ml-inference"}[2m])) or vector(0)

Benefits:

  • Reduce costs during off-hours
  • Perfect for batch/cron workloads
  • Eliminates idle resource consumption

Pattern 2: Multiple Triggers (Composite Scaling)

Combine multiple event sources for intelligent scaling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: multi-trigger-scaler
  namespace: demo
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 2
  maxReplicaCount: 50
  cooldownPeriod: 300  # 5 minutes
  pollingInterval: 30  # Check every 30 seconds
  fallback:
    failureThreshold: 3
    replicas: 10  # Safe replica count during failures
  triggers:
    - type: postgresql
      metadata:
        host: "postgresql.demo.svc.cluster.local"
        port: "5432"
        userName: "postgres"
        dbName: "demo"
        sslmode: "disable"
        query: "SELECT COALESCE(COUNT(*), 0) FROM orders WHERE status = 'pending'"
        targetQueryValue: "100"
        activationQueryValue: "10"
      authenticationRef:
        name: postgres-trigger-auth-multi
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
        query: sum(rate(http_request_duration_seconds_count{service="order-processor"}[2m]))
        threshold: "500"
        activationThreshold: "50"
    - type: cpu
      metadata:
        type: Utilization
        value: "70"

Pattern 3: Fallback Configuration for Scaler Failures

Protect against external service failures with KEDA fallback:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: resilient-scaler
spec:
  scaleTargetRef:
    name: api-service
  minReplicaCount: 3
  maxReplicaCount: 50
  fallback:
    failureThreshold: 3 # Number of consecutive failures before fallback
    replicas: 10 # Safe replica count during scaler failure
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        query: sum(rate(http_requests_total[1m]))
        threshold: "100"

Why fallback matters: If your Prometheus server goes down or your database scaler can’t connect, KEDA will scale to the fallback replica count instead of leaving you stuck at minReplicaCount. This prevents incidents caused by monitoring failures.

Best practices:

  • Set failureThreshold to 3-5 to avoid false positives from transient failures
  • Choose fallback replicas based on historical average load (typically 50-70% of maxReplicas)
  • Monitor keda_scaler_errors_total to detect when fallback activates

Pattern 4: ScaledJob for Batch Processing

Use ScaledJob for one-off job scaling instead of Deployments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: video-encoding-scaler
  namespace: demo
spec:
  jobTargetRef:
    parallelism: 1                    # Each job runs 1 pod
    completions: 1                    # Job completes after 1 success
    backoffLimit: 3                   # Retry up to 3 times on failure
    activeDeadlineSeconds: 600        # Job timeout: 10 minutes
    template:
      spec:
        restartPolicy: Never  # Required for Jobs
        containers:
          - name: encoder
            image: mysql:8.0
            resources:
              requests:
                memory: "1Gi"
                cpu: "1000m"
              limits:
                memory: "2Gi"
                cpu: "2000m"
  pollingInterval: 30                 # Check queue every 30 seconds
  successfulJobsHistoryLimit: 5       # Keep last 5 successful jobs
  failedJobsHistoryLimit: 5           # Keep last 5 failed jobs
  maxReplicaCount: 20                 # Maximum 20 parallel encoding jobs
  scalingStrategy:
    strategy: "default"               # Create jobs immediately when work exists
  triggers:
    - type: mysql
      metadata:
        host: "mysql.demo.svc.cluster.local"
        port: "3306"
        username: "root"
        dbName: "media"
        query: "SELECT COUNT(*) FROM videos WHERE status = 'pending'"
        queryValue: "1"  # Create 1 job per pending video
        activationQueryValue: "0"  # Only create jobs if there are pending videos
      authenticationRef:
        name: mysql-trigger-auth # Reference to secret, never hardcode credentials!

pollingInterval tradeoffs:

  • Lower values (10-30s): Faster response to new work, more frequent queries to external systems
  • Higher values (60-300s): Reduced load on external systems, slower scale-up response
  • Recommendation: Start with 30s for most workloads; increase to 60-120s if external system can’t handle frequent queries

When to use ScaledJob:

  • Video/image processing pipelines
  • Data ETL and transformation jobs
  • Machine learning training workloads
  • Report generation and analytics

VPA: Vertical Pod Autoscaler

VPA solves a different problem: the resource request guessing game. You know the drill. You deploy something, pick some numbers for CPU and memory requests, cross your fingers, and wait to see what happens.

Too high? You’re wasting cluster capacity and money. Too low? OOMKills and CPU throttling. And workloads change over time, so even if you got it right once, it might be wrong six months later.

VPA watches your actual resource usage and adjusts requests/limits automatically. No more guessing, no more manual tuning sessions. It just works.

VPA Components

ComponentFunctionDescription
RecommenderAnalysisMonitors resource usage, provides recommendations
UpdaterEnforcementEvicts pods that need updated resources
Admission ControllerInjectionApplies VPA recommendations to new pods

VPA Update Modes

VPA supports multiple operational modes:

ModeBehaviorUse Case
OffRecommendations only (no action)Testing, observability
InitialApply only at pod creationStateful apps, cautious adoption
RecreateEvict and recreate podsStateless apps, aggressive optimization
AutoAutomatically choose best strategyGeneral purpose

VPA in Action: Resource Optimization

Basic VPA Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: demo
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi
        controlledResources: ["cpu", "memory"]

VPA with Resource Boundaries

Set guardrails to prevent extreme scaling:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: database-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres
  updatePolicy:
    updateMode: "Initial" # Only on pod creation
  resourcePolicy:
    containerPolicies:
      - containerName: postgres
        minAllowed:
          cpu: 500m
          memory: 1Gi
        maxAllowed:
          cpu: 8
          memory: 16Gi
        controlledResources: ["memory"] # Only manage memory
        mode: Auto

VPA with PodDisruptionBudgets

When using VPA in “Auto” or “Recreate” mode, combine with PodDisruptionBudgets to prevent excessive disruption:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
  namespace: demo
spec:
  minAvailable: 2 # Ensure at least 2 pods remain available during VPA updates
  selector:
    matchLabels:
      app: web-app
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: demo
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 4Gi

Why this matters: VPA’s “Auto” and “Recreate” modes work by evicting pods to apply new resource requests. Without a PodDisruptionBudget, VPA could evict too many pods simultaneously, causing service disruption.

Best practices:

  • Set minAvailable to at least minReplicas - 1 for your deployment
  • For critical services, use minAvailable: 50% to maintain half capacity during updates
  • Monitor VPA eviction events: kubectl get events --field-selector reason=EvictedByVPA
  • Consider using updateMode: "Initial" for StatefulSets to avoid disruption entirely

VPA Recommendation-Only Mode

Use VPA as an observability tool without automatic changes:

1
2
3
4
5
6
7
8
9
10
11
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: recommendation-only-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: critical-app
  updatePolicy:
    updateMode: "Off" # Recommendations only

Then query recommendations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
kubectl describe vpa recommendation-only-vpa

# Output shows recommended resources:
# Container Recommendations:
#   Container Name: app
#   Lower Bound:
#     Cpu:     250m
#     Memory:  256Mi
#   Target:
#     Cpu:     500m
#     Memory:  512Mi
#   Upper Bound:
#     Cpu:     1
#     Memory:  1Gi

Combining HPA, KEDA, and VPA: The Ultimate Strategy

The real power comes from using these tools together:

Keda Events

Strategy 1: HPA + VPA (Complementary Scaling)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# HPA for horizontal scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
---
# VPA for resource optimization
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        controlledResources: ["memory"] # VPA manages memory, HPA uses CPU

The trick here: let HPA handle horizontal scaling based on CPU, and VPA handles vertical memory optimization. Keep them on different metrics and they won’t fight each other.

Strategy 2: KEDA + VPA (Event-Driven with Optimization)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# KEDA for event-driven scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-scaler
spec:
  scaleTargetRef:
    name: worker
  minReplicaCount: 1
  maxReplicaCount: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        query: sum(tasks_pending_total)
        threshold: "100"
---
# VPA to optimize resource allocation per worker
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: worker-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  updatePolicy:
    updateMode: "Initial" # Update on scale events

KEDA handles the “how many workers” question, VPA handles the “how big should each worker be” question. Perfect combo for background processing workloads.

Strategy 3: Three-Tier Autoscaling (KEDA + VPA + Karpenter)

For complete infrastructure efficiency:

  1. KEDA: Scales pods based on events
  2. VPA: Optimizes resource requests per pod
  3. Karpenter: Provisions nodes to accommodate pods
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 1. KEDA ScaledObject
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: complete-scaler
spec:
  scaleTargetRef:
    name: processor
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        query: sum(workload_queue_depth)
        threshold: "50"
---
# 2. VPA for resource optimization
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: processor-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: processor
  updatePolicy:
    updateMode: "Auto"
---
# 3. Karpenter NodePool (from previous setup)
# Automatically provisions nodes as KEDA scales pods

Observability and Monitoring

KEDA Metrics

KEDA exposes Prometheus metrics for monitoring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Check if scalers are active (1 = active, 0 = inactive)
keda_scaler_active

# Scaling errors
rate(keda_scaled_object_errors_total[5m])

# Current metric value per scaler
keda_scaler_metrics_value

# Count active scaled objects
count(keda_scaler_active == 1)

# Number of replicas per ScaledObject (use standard Kubernetes metrics)
kube_deployment_status_replicas

Note: KEDA 2.17+ uses updated metric names. Ensure your Prometheus has a ServiceMonitor configured to scrape KEDA operator metrics from the kube-system namespace.

Setting up KEDA Metrics Scraping:

To enable Prometheus to scrape KEDA metrics, create a ServiceMonitor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keda-operator
  namespace: kube-system
  labels:
    app.kubernetes.io/name: keda-operator
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: keda-operator
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

This allows Prometheus to discover and scrape KEDA metrics, making them available for Grafana dashboards and alerting rules.

VPA Metrics

Monitor VPA recommendations and updates:

1
2
3
4
5
# Get current recommendations
kubectl get vpa -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.recommendation.containerRecommendations[0].target}{"\n"}{end}'

# Monitor VPA updater events
kubectl get events --field-selector involvedObject.kind=VerticalPodAutoscaler

Prometheus Alerting Rules

Deploy these alerts to detect autoscaling issues:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
groups:
  - name: keda_alerts
    interval: 30s
    rules:
      - alert: KEDAScalerErrors
        expr: rate(keda_scaled_object_errors_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "KEDA scaler  experiencing errors"
          description: "Scaler  has failed  times in the last 5 minutes"

      - alert: KEDAScalerDown
        expr: up{job="keda-metrics-apiserver"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "KEDA metrics server is down"
          description: "KEDA cannot provide metrics to HPA. Autoscaling is disabled."

      - alert: KEDAScalingStuck
        expr: |
          (keda_scaler_metrics_value > 0)
          and (keda_scaler_active == 1)
          and (kube_deployment_status_replicas{namespace="demo"} <= 2)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "ScaledObject  stuck at minReplicas"
          description: "Metrics indicate scaling should occur, but replicas remain at minimum"

  - name: vpa_alerts
    interval: 60s
    rules:
      - alert: VPARecommendationDeviation
        expr: |
          (kube_pod_container_resource_requests_memory_bytes /
           vpa_recommendation_target) > 1.5 or
          (kube_pod_container_resource_requests_memory_bytes /
           vpa_recommendation_target) < 0.5
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "VPA recommendation differs significantly from current requests"
          description: "Pod  memory requests deviate >50% from VPA target"

      - alert: HighOOMKillRate
        expr: rate(container_memory_failures_total{failure_type="oom"}[10m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Container  experiencing OOM kills"
          description: "VPA may need adjustment or application has memory leak"

      - alert: VPAEvictionRateTooHigh
        expr: rate(kube_pod_deletion_timestamp{reason="EvictedByVPA"}[1h]) > 5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "VPA evicting pods too frequently for "
          description: "More than 5 pods/hour evicted. Consider using 'Initial' mode or adjusting boundaries"

SLO/SLI Recommendations for Autoscaling

Define Service Level Objectives for your autoscaling system:

SLI (Service Level Indicator)SLO TargetMeasurementWhy It Matters
Scaling Latency95% of scale-up events complete within 2 minuteshistogram_quantile(0.95, scaling_duration_seconds)Users experience degraded performance during slow scale-up
Scaler Availability99.5% uptime for KEDA metrics serveravg_over_time(up{job="keda-metrics-apiserver"}[30d])Scaler failures prevent autoscaling, causing incidents
VPA Recommendation AccuracyResource recommendations within 20% of actual usageabs((vpa_target - actual_usage) / actual_usage)Inaccurate VPA causes OOMKills or wasted resources
Scale-to-Zero Success Rate99% of idle workloads scale to zero within cooldown periodCount successful vs. stuck scale-downsCost optimization goal not met if scale-to-zero fails
OOMKill Rate< 0.1% of pod restarts due to OOMrate(container_memory_failures_total{failure_type="oom"}[7d])Indicates VPA recommendations are too low

Example SLO Dashboard Query (Prometheus):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Scaling latency P95
histogram_quantile(0.95,
  rate(keda_scaling_duration_seconds_bucket[5m])
)

# KEDA metrics server uptime (30-day)
avg_over_time(up{job="keda-metrics-apiserver"}[30d]) * 100

# VPA accuracy: percentage of pods within 20% of target
100 * (
  count(
    abs(kube_pod_container_resource_requests_memory_bytes - vpa_recommendation_target)
    / vpa_recommendation_target < 0.2
  )
  / count(kube_pod_container_resource_requests_memory_bytes)
)

Grafana Dashboards

Essential panels for autoscaling observability:

PanelMetricPurpose
KEDA Scaling Activitykube_deployment_status_replicas by deploymentTrack replica changes
KEDA Scaler Statuskeda_scaler_active, keda_scaler_metrics_valueMonitor active scalers
VPA RecommendationsVPA status APIMonitor resource adjustments
Cost ImpactNode cost × resource usageCalculate savings
Scaling LatencyEvent timestamp to pod readyMeasure responsiveness
SLO ComplianceSee SLO table aboveTrack autoscaling reliability

Tip: Check the demo repository for a complete KEDA Overview dashboard with working queries for KEDA 2.17+.

Testing Autoscaling in Non-Production Environments

Before deploying autoscaling to production, validate behavior in staging or dev environments:

Load Testing Strategy

1
2
3
4
5
6
7
8
9
# Example: Load test KEDA-scaled application
# Install k6 or hey for HTTP load testing
hey -z 5m -c 100 -q 50 https://staging.example.com/api

# Monitor scaling behavior
watch kubectl get hpa,scaledobject,pods -n staging

# Check KEDA metrics
kubectl get scaledobject my-app -n staging -o jsonpath='{.status.externalMetricNames}'

Validation Checklist

Test ScenarioWhat to ValidateSuccess Criteria
Scale UpPods increase under loadReplicas reach expected count within 60s
Scale DownPods decrease after load dropsCooldown period respected, gradual scale down
Scale to ZeroPods terminate during idleAll pods removed after cooldown period
VPA RecommendationsResource adjustments are reasonableRecommendations within 20% of actual usage
Fallback ActivationKEDA switches to fallback replicasMonitor keda_scaled_object_errors_total for fallback triggers
PDB ComplianceVPA respects PodDisruptionBudgetsNo more than N pods evicted simultaneously

Staged Rollout Approach

  1. Dev/Staging: Test with updateMode: "Off" for VPA, validate KEDA scaler connections
  2. Canary: Deploy to 10% of production pods, monitor for 24-48 hours
  3. Production: Gradually increase coverage, starting with non-critical workloads
  4. Critical Services: Only enable after observing behavior on similar workloads

Blast Radius Protection

Limit the impact of aggressive scaling with gradual rollout:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gradual-scaler
spec:
  scaleTargetRef:
    name: critical-api
  minReplicaCount: 10
  maxReplicaCount: 100
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 60 # Wait 60s before scaling up
          policies:
            - type: Percent
              value: 50 # Scale up by max 50% at a time
              periodSeconds: 60
        scaleDown:
          stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
          policies:
            - type: Pods
              value: 5 # Remove max 5 pods at a time
              periodSeconds: 60

Why this matters: Without rate limiting, a sudden spike in metrics could scale from 10 to 100 replicas instantly, potentially overwhelming downstream dependencies or causing cascading failures.

Common Pitfalls and Best Practices

KEDA Best Practices

PracticeWhyHowAlert Threshold
Set Reasonable LimitsPrevent runaway scalingUse maxReplicaCount conservativelyN/A
Configure CooldownAvoid scaling thrashingSet cooldownPeriod: 300 (5 min)N/A
Monitor Scaler HealthDetect connection issuesAlert on keda_scaled_object_errors_totalrate(keda_scaled_object_errors_total[5m]) > 0.1
Use AuthenticationSecure external connectionsDefine TriggerAuthentication resourcesN/A
Test Scale-to-ZeroEnsure graceful shutdownVerify application handles SIGTERMN/A
Monitor Scaling LagDetect slow scale-upAlert on time from trigger to pod ready> 120s for critical services

VPA Best Practices

PracticeWhyHowAlert Threshold
Start with “Off” ModeUnderstand recommendationsReview suggestions before auto-applyN/A
Avoid with HPA (CPU)Prevent scaling conflictsUse VPA for memory, HPA for CPUN/A
Set Min/Max BoundariesPrevent extreme valuesDefine minAllowed and maxAllowedN/A
Use “Initial” for StatefulSetsAvoid data disruptionOnly update on pod creationN/A
Monitor OOMKillsValidate VPA effectivenessTrack container_memory_failures_totalrate(container_memory_failures_total[5m]) > 0
Track Eviction RatePrevent excessive churnMonitor VPA-triggered pod restarts> 5 evictions/hour per deployment

Anti-Patterns to Avoid

Anti-PatternWhy It FailsCorrect Approach
Using HPA and VPA on same metricCreates scaling conflicts and resource thrashingUse HPA for CPU, VPA for memory only
No resource boundaries in VPACan cause excessive resource allocation or OOMKillsAlways define minAllowed and maxAllowed
KEDA scaling without monitoringScaling failures go unnoticed, SLO violationsSet up alerts on keda_scaled_object_errors_total
VPA “Auto” mode on stateful appsPod evictions cause data disruption and downtimeUse “Initial” or “Off” mode for databases
Ignoring cooldown periodsRapid scale up/down wastes resources and moneyConfigure appropriate cooldownPeriod values
Missing authentication for scalersSecurity vulnerabilities and unauthorized accessAlways use TriggerAuthentication for external sources
VPA without PodDisruptionBudgetsExcessive pod evictions during updates cause outagesDefine PDBs with minAvailable to maintain availability

Troubleshooting Common Issues

IssueSymptomRoot CauseSolution
KEDA not scalingPods remain at minReplicas despite metricsScaler connection failure, incorrect queryCheck logs: kubectl logs -n keda deploy/keda-operator
Verify scaler config and credentials
VPA recommendations ignoredPods not updated after VPA changesAdmission controller not runningVerify: kubectl get pods -n kube-system \| grep vpa-admission
Check admission webhook config
Scaling thrashingRapid scale up/down cyclesCooldown period too short, conflicting metricsIncrease cooldownPeriod to 300s
Review HPA behavior policies
OOMKills after VPA updatePods crash with out-of-memoryVPA underestimated requirementsCheck minAllowed boundaries
Review actual usage vs. VPA target
KEDA fallback not activatingPods stuck at minReplicas during scaler failurefailureThreshold not reachedLower failureThreshold to 3
Check keda_scaled_object_errors_total
Scale-to-zero not workingPods remain at 1 replica during idleMetrics never reach zero, min replicas setVerify query returns 0 during idle
Set minReplicaCount: 0
VPA evicting too many podsService degradation during VPA updatesNo PodDisruptionBudget configuredAdd PDB with minAvailable: 2
Use updateMode: "Initial" for critical apps
HPA and VPA conflictResource thrashing, unstable replica countBoth managing same metric (CPU)Use HPA for CPU, VPA for memory only
Set controlledResources: ["memory"] in VPA

Debugging Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Check KEDA scaler status
kubectl get scaledobject -n demo -o wide
kubectl describe scaledobject my-app -n demo

# View KEDA operator logs
kubectl logs -n keda deploy/keda-operator --tail=100 -f

# Check HPA created by KEDA
kubectl get hpa -n demo
kubectl describe hpa keda-hpa-my-app -n demo

# View VPA recommendations
kubectl describe vpa my-app-vpa
kubectl get vpa my-app-vpa -o jsonpath='{.status.recommendation.containerRecommendations[0]}'

# Check VPA admission controller logs
kubectl logs -n kube-system deploy/vpa-admission-controller --tail=50

# Monitor scaling events
kubectl get events -n demo --sort-by='.lastTimestamp' | grep -E 'Scaled|Evicted'

# Test scaler connectivity (PostgreSQL example)
kubectl run psql-test --rm -it --image=postgres:15 -- \
  psql "postgresql://user:pass@host:5432/db" -c "SELECT COUNT(*) FROM table"

# Check KEDA metrics server health
kubectl get apiservice v1beta1.external.metrics.k8s.io -o yaml

Common Scaler-Specific Issues

PostgreSQL/MySQL Scaler:

  • Connection refused: Verify network policies allow KEDA pod → database
  • Authentication failed: Check TriggerAuthentication secret exists and is correct
  • Query returns null: Ensure query always returns a numeric value (use COALESCE(COUNT(*), 0))

Prometheus Scaler:

  • No metrics returned: Verify Prometheus is accessible from KEDA namespace
  • Incorrect threshold: Test query in Prometheus UI first to validate results
  • High cardinality: Avoid queries with too many time series (use aggregation)

Cron Scaler:

  • Wrong timezone: Explicitly set timezone metadata (default is UTC)
  • Doesn’t scale down: Verify end schedule is set, not just start
  • Overlapping schedules: Use separate ScaledObjects for different time windows

Installation and Setup

Installing KEDA

Note: The demo repository includes complete Terraform modules and Helm values for production deployments on AKS (Azure Kubernetes Service).

1
2
3
4
5
6
7
8
9
10
# Add KEDA Helm repository
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

# Install KEDA
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --set prometheus.metricServer.enabled=true \
  --set prometheus.operator.enabled=true

Verify installation:

1
2
3
4
kubectl get pods -n keda
# Expected output:
# keda-operator-*
# keda-metrics-apiserver-*

Installing VPA

1
2
3
4
5
6
# Clone VPA repository
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler

# Install VPA components
./hack/vpa-up.sh

Verify installation:

1
2
3
4
5
kubectl get pods -n kube-system | grep vpa
# Expected output:
# vpa-admission-controller-*
# vpa-recommender-*
# vpa-updater-*

Alternatively, use Helm:

1
2
3
4
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install vpa fairwinds-stable/vpa \
  --namespace vpa \
  --create-namespace

Setting Up TriggerAuthentication (Security Best Practice)

For PostgreSQL example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: demo
type: Opaque
stringData:
  connectionString: postgresql://user:password@postgres.production.svc.cluster.local:5432/dbname
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: postgres-trigger-auth
  namespace: demo
spec:
  secretTargetRef:
    - parameter: connectionString
      name: postgres-secret
      key: connectionString

Then reference it in your ScaledObject:

1
2
3
4
5
6
7
triggers:
  - type: postgresql
    metadata:
      query: "SELECT COUNT(*) FROM ..."
      targetQueryValue: "100"
    authenticationRef:
      name: postgres-trigger-auth # Reference the TriggerAuthentication

This keeps credentials out of your YAML manifests and allows you to rotate secrets without modifying ScaledObjects.

Use Cases by Workload Type

Workload TypeRecommended StrategyRationale
Web APIsHPA (CPU) + VPA (memory)Traffic-driven horizontal scaling, optimize resource allocation
Database-Heavy AppsKEDA (PostgreSQL/MySQL) + VPAScale based on DB load, optimize per-pod resources
Batch JobsKEDA ScaledJob (MySQL/Postgres)Process pending records, scale-to-zero capability
Time-Based WorkloadsKEDA (Cron) + VPABusiness hours scaling, cost optimization
DatabasesVPA (Initial mode only)Vertical optimization without disruption
ML InferenceKEDA (HTTP metrics) + scale-to-zeroOn-demand scaling, eliminate idle costs
MicroservicesKEDA (Prometheus) + VPA + KarpenterComplete infrastructure efficiency

Conclusion

HPA gets you started, but KEDA and VPA take autoscaling to where it needs to be for real production workloads. KEDA lets you scale on actual business metrics like database load, time windows, HTTP traffic, whatever drives your app. VPA stops the resource guessing game and automatically right-sizes your pods.

Here’s how I think about using them together:

  • HPA for basic CPU-driven web traffic
  • KEDA when you need event-driven or scheduled scaling
  • VPA to optimize resource requests across everything
  • Karpenter to tie it all together at the node level

Start small. Enable VPA in “Off” mode first. Just watch what it recommends. Then try KEDA on a non-critical workload, maybe something that processes batch jobs. See how it behaves, tune the thresholds, then expand from there.

The mistake I see teams make is trying to implement everything at once. Pick one use case, get it working well, learn from it, then move to the next. Monitor the impact on both performance and cost. Those numbers will guide your next moves.

If you want to try this out, I have put together a demo repo with working examples for all the patterns we covered. It’s got PostgreSQL scaling, cron-based workloads, Prometheus metrics, and the full HPA+KEDA+VPA integration. Link’s above.

Autoscaling in Kubernetes doesn’t have to be complicated. These tools make it almost boring, and boring is exactly what you want in production.

Ready to implement? Check out the complete demo repository with working examples, load testing, and infrastructure templates:

Kubernetes Autoscaling Demo

Kubernetes Autoscaling

This post is licensed under CC BY 4.0 by the author.