Post

PDB Management Operator - Intelligent Pod Disruption Budget Automation for Kubernetes

Pod Disruption Budgets (PDBs) are critical for maintaining application availability during voluntary disruptions like node drains, cluster upgrades, or autoscaler operations. Yet in practice, PDB management often becomes an afterthought - teams either forget to create them, configure them incorrectly, or struggle to maintain consistency across hundreds of deployments.

The PDB Management Operator solves this problem by automating PDB lifecycle management based on availability requirements, organizational policies, and component functions. Originally developed as a personal project, I donated it to the TM Forum ODA Canvas project to help telcos and enterprises manage availability at scale.

What is TM Forum ODA Canvas?

While the PDB Management Operator works standalone in any Kubernetes cluster, I chose to donate it to the TM Forum ODA Canvas project where it addresses a critical need for telecom-grade availability management. Let me explain why this makes sense.

The TM Forum is a global industry association for digital business, with over 850 member companies including major telcos like Vodafone, AT&T, and Deutsche Telekom. They develop open standards and best practices for the telecommunications industry.

The Open Digital Architecture (ODA) is TM Forum’s blueprint for modular, cloud-native telecom software. It defines how telecom applications should be built, deployed, and managed using standardized components and APIs.

ODA Canvas is the reference implementation of the ODA runtime environment - a Kubernetes-based platform that:

  • Manages ODA Components: Lifecycle management for standardized telecom software components
  • Exposes APIs: Automatic API gateway integration for TMF Open APIs
  • Handles Identity: Authentication and authorization for components
  • Ensures Availability: Where the PDB Management Operator comes in
flowchart TB
    subgraph Canvas["ODA Canvas"]
        subgraph Operators["Canvas Operators"]
            CompOp["Component Operator"]
            APIGw["API Operator"]
            Identity["Identity Operator"]
            PDB["PDB Management Operator"]
        end

        subgraph Components["ODA Components"]
            TMF620["TMF620<br/>Product Catalog"]
            TMF622["TMF622<br/>Product Ordering"]
            TMF629["TMF629<br/>Customer Management"]
        end
    end

    CompOp -->|manages lifecycle| Components
    APIGw -->|exposes APIs| Components
    Identity -->|authenticates| Components
    PDB -->|ensures availability| Components

The PDB Management Operator is one of several operators that make up the ODA Canvas, specifically focused on ensuring component availability during cluster operations.

Who Should Read This?

This post is for:

  • SREs managing large Kubernetes deployments who want automated PDB management
  • Platform Engineers building internal developer platforms with availability guarantees
  • DevOps Teams in regulated environments needing audit trails and policy enforcement
  • Organizations using ODA Canvas who need enterprise-grade availability management

If you’re tired of manually creating PDBs or dealing with inconsistent availability configurations across your clusters, read on.

What is the PDB Management Operator?

The PDB Management Operator is a Kubernetes operator built with Operator SDK in Go that automatically creates, updates, and manages Pod Disruption Budgets based on deployment annotations and organizational policies.

Unlike simple admission controllers or Helm hooks, this operator provides:

  • Dual Configuration Model: Both annotation-based (developer-friendly) and policy-based (platform team control) approaches
  • Intelligent Enforcement: Three modes that balance flexibility with governance
  • Component-Aware Logic: Automatic availability upgrades for security-critical workloads
  • Full Lifecycle Management: Creates, updates, and deletes PDBs as deployments change
  • Enterprise Observability: Prometheus metrics, OpenTelemetry tracing, and structured audit logs

Why Not Just Use Kyverno or OPA Gatekeeper?

A common question: “Why a dedicated operator instead of policy engines?” Here’s the distinction:

AspectKyverno/GatekeeperPDB Management Operator
Primary FocusAdmission control (validate/mutate)Full lifecycle management
Policy ConflictsBlock or warn at admissionPriority-based resolution with tie-breaking
Maintenance WindowsNot supportedBuilt-in time-based PDB suspension
Component IntelligenceManual configurationAutomatic function detection
ObservabilityBasic metricsFull tracing, metrics, audit logs
ReconciliationNone (admission only)Continuous drift detection and correction

Bottom line: Policy engines are great for validation, but they can’t manage PDB lifecycle, handle maintenance windows, or reconcile drift. In practice, this means you can use them together, the operator creates PDBs, policy engines validate them.

What This Operator Is Not

To set clear expectations:

  • Not a policy engine replacement - Use alongside Kyverno or OPA Gatekeeper, not instead of them
  • Not for single-replica workloads - PDBs require 2+ replicas to be meaningful
  • Not a StatefulSet controller (yet) - Today it focuses on Deployments; StatefulSet support is on the roadmap
  • Not a service mesh - Does not handle traffic management or circuit breaking

Core Concepts

flowchart TB
    subgraph Inputs["Configuration Sources"]
        Annotations["Deployment Annotations<br/>oda.tmforum.org/availability-class"]
        Policies["AvailabilityPolicy CRD<br/>Organization-wide rules"]
    end

    subgraph Operator["PDB Management Operator"]
        Controller["Deployment Controller"]
        PolicyCache["Policy Cache"]
        Resolver["Configuration Resolver"]
    end

    subgraph Outputs["Managed Resources"]
        PDB["Pod Disruption Budget<br/>Auto-created & managed"]
        Events["Kubernetes Events<br/>Audit trail"]
        Metrics["Prometheus Metrics<br/>Observability"]
    end

    Annotations --> Controller
    Policies --> PolicyCache
    PolicyCache --> Resolver
    Controller --> Resolver
    Resolver --> PDB
    Controller --> Events
    Controller --> Metrics

The operator watches Deployments and AvailabilityPolicy resources, resolves the appropriate configuration based on enforcement rules, and manages the corresponding PDB.

Availability Classes

The operator uses predefined availability classes defined as percentages, then converts them to integer minAvailable values based on replica count:

ClassCore/ManagementSecurityDescription
non-critical20%50%Development, batch jobs, testing workloads
standard50%75%Typical production microservices
high-availability75%75%Stateful apps, databases, message queues
mission-critical90%90%Critical infrastructure, payment systems
customUser-definedUser-definedFull control over PDB configuration

The classes are defined as percentages, but the operator converts them to integer minAvailable values based on each workload’s replica count.

Note: Security components (detected via annotations or naming patterns) are automatically upgraded to higher availability levels. In practice, this means an auth-service marked as non-critical will receive standard level protection (50% instead of 20%) - the operator assumes security workloads should never have minimal availability.

Configuration Model

Annotation-Based Configuration

The simplest approach developers add annotations to their deployments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  annotations:
    # Required: Availability class
    oda.tmforum.org/availability-class: "high-availability"

    # Optional: Component function (affects PDB calculation)
    oda.tmforum.org/component-function: "core"

    # Optional: Component name for identification
    oda.tmforum.org/componentName: "payment-service"

    # Optional: Maintenance window
    oda.tmforum.org/maintenance-window: "02:00-04:00 UTC"
spec:
  replicas: 6
  # ... rest of deployment spec

The operator automatically creates a PDB:

1
2
3
kubectl get pdb payment-api-pdb
# NAME              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# payment-api-pdb   5               N/A               1                     5s

Note: The operator calculates minAvailable as an integer (5 out of 6 replicas for 75%) to avoid percentage rounding ambiguity.

Policy-Based Configuration

For platform teams who need organization-wide control:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: availability.oda.tmforum.org/v1alpha1
kind: AvailabilityPolicy
metadata:
  name: production-security-policy
  namespace: canvas
spec:
  # Availability requirement
  availabilityClass: mission-critical

  # Enforcement behavior
  enforcement: strict
  priority: 1000 # Higher number = higher priority

  # Component selection
  componentSelector:
    componentFunctions: [security]
    namespaces: [production, staging]
    matchLabels:
      tier: critical

  # Maintenance windows
  maintenanceWindows:
    - start: "02:00"
      end: "04:00"
      timezone: "UTC"
      daysOfWeek: [0, 6] # Weekends only

Enforcement Modes

The operator supports three enforcement modes that balance developer flexibility with platform governance:

Strict Enforcement

Policy always wins annotations are ignored completely.

1
2
3
spec:
  enforcement: strict
  # Deployment annotations have no effect

Use Cases: Compliance requirements, security policies, regulated environments where platform teams must maintain control.

Flexible Enforcement

Annotations accepted if they meet minimum requirements.

1
2
3
spec:
  enforcement: flexible
  minimumClass: standard # Annotations can be standard or higher

Behavior:

  • Annotation >= minimum class: Annotation configuration used
  • Annotation < minimum class: Minimum class enforced

Use Cases: Teams can opt for higher availability but can’t go below organizational standards.

Advisory Enforcement

Annotations preferred with controlled override capabilities.

1
2
3
4
spec:
  enforcement: advisory
  allowOverride: true
  overrideRequiresReason: true # Must provide justification

Use Cases: Development environments, flexible policies with audit trails.

Policy Resolution & Tie-Breaking

When multiple policies match a deployment, the operator uses deterministic resolution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Policy A: namespace-wide, priority 100
apiVersion: availability.oda.tmforum.org/v1alpha1
kind: AvailabilityPolicy
metadata:
  name: production-baseline
  namespace: canvas
spec:
  availabilityClass: standard
  priority: 100
  componentSelector:
    namespaces: [production]
---
# Policy B: security-focused, priority 1000
apiVersion: availability.oda.tmforum.org/v1alpha1
kind: AvailabilityPolicy
metadata:
  name: security-critical
  namespace: canvas
spec:
  availabilityClass: mission-critical
  priority: 1000
  componentSelector:
    componentFunctions: [security]
    namespaces: [production]

Resolution order:

  1. Priority wins - Higher priority policies take precedence (security-critical: 1000 > production-baseline: 100)
  2. Lexicographic tie-break - If priorities are equal, policies are sorted by metadata.name (stable + deterministic)
  3. Audit trail - Conflicts are logged and exposed via metrics (pdb_management_policy_conflicts_total)

For a deployment in production namespace with component-function: security, Policy B wins because it has higher priority, even though both policies match.

Architecture

flowchart TB
    subgraph Cluster["Kubernetes Cluster"]
        subgraph Controllers["Operator Controllers"]
            DeployCtrl["Deployment Controller"]
            PolicyCtrl["AvailabilityPolicy Controller"]
        end

        subgraph Cache["Caching Layer"]
            PolicyCache["Policy Cache<br/>5-min TTL"]
            StateCache["Deployment State<br/>Change Detection"]
        end

        subgraph Resources["Kubernetes Resources"]
            Deploy["Deployments"]
            PDB["PDBs"]
            Policy["AvailabilityPolicies"]
        end

        subgraph Observability["Observability"]
            Metrics["Prometheus Metrics"]
            Tracing["OpenTelemetry Traces"]
            Audit["Structured Audit Logs"]
        end
    end

    Deploy -->|watches| DeployCtrl
    Policy -->|watches| PolicyCtrl
    PolicyCtrl -->|invalidates| PolicyCache
    DeployCtrl -->|reads| PolicyCache
    DeployCtrl -->|checks| StateCache
    DeployCtrl -->|manages| PDB
    DeployCtrl -->|exports| Metrics
    DeployCtrl -->|sends| Tracing
    DeployCtrl -->|writes| Audit

Key Components

ComponentPurpose
Deployment ControllerWatches deployments, resolves configuration, manages PDBs
AvailabilityPolicy ControllerManages policy CRD lifecycle, updates status
Policy CacheCaches policies for performance (5-min TTL)
State CacheFingerprint-based change detection to avoid unnecessary reconciliations
Event RecorderEmits Kubernetes events for audit trail
Metrics CollectorExports Prometheus metrics for monitoring

PDB Management Operator: Pros and Cons

Before implementing, let’s evaluate honestly.

Pros

AdvantageDescription
Kubernetes-NativeUses CRDs and controller patterns familiar to K8s operators
Dual ConfigurationSupports both developer annotations and platform policies
Intelligent ResolutionPriority-based policy matching with deterministic tie-breaking
Security-AwareAutomatic availability upgrades for security components
Maintenance WindowsBuilt-in support for scheduled maintenance periods
Full ObservabilityPrometheus metrics, OpenTelemetry tracing, structured audit logs
ODA Canvas IntegrationNative support for TM Forum ODA component architecture
MCP IntegrationAI-powered cluster analysis via Model Context Protocol

Cons

LimitationDescription
Learning CurveNew concepts (AvailabilityPolicy, enforcement modes) require documentation
Minimum ReplicasPDBs only created for deployments with 2+ replicas
Cache LatencyPolicy changes may take up to 5 minutes to propagate
Webhook DependenciesValidation webhooks require cert-manager

Note on Policy Namespace: AvailabilityPolicies in the canvas namespace can affect deployments cluster-wide. This is by design: only platform administrators with access to the canvas namespace can create or modify cluster-wide availability policies, providing a clear separation between platform governance and application teams.

Operational Footprint

For SREs concerned about control-plane impact:

  • Memory: Typically tens of MB for the controller manager
  • CPU: Minimal - reconciliation is event-driven, not polling-based
  • API Server Load: Low - uses informer caches and 5-minute policy TTL to minimize API calls
  • Reconcile Frequency: Only on deployment changes, policy updates, or PDB drift detection

In practice, this means the operator adds negligible overhead even in clusters with hundreds of deployments. See the demo repository for benchmark snapshots.

When to Use This Operator

The PDB Management Operator is a strong fit when:

  • You have 50+ deployments and manual PDB management is impractical
  • You need policy enforcement across teams with different requirements
  • You’re in a regulated environment requiring audit trails
  • You use ODA Canvas or similar component-based architectures
  • You want consistency without sacrificing developer flexibility

Practical Examples

Example 1: Simple Deployment with Annotation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
  name: catalog-api
  annotations:
    oda.tmforum.org/availability-class: "high-availability"
spec:
  replicas: 4
  selector:
    matchLabels:
      app: catalog-api
  template:
    metadata:
      labels:
        app: catalog-api
    spec:
      containers:
        - name: api
          image: catalog-api:v1.2.0

Result: PDB created with minAvailable: 3 (75% of 4 replicas), allowing 1 disruption.

Example 2: Security Service with Strict Policy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Policy (applied by platform team)
apiVersion: availability.oda.tmforum.org/v1alpha1
kind: AvailabilityPolicy
metadata:
  name: security-strict
  namespace: canvas
spec:
  availabilityClass: mission-critical
  enforcement: strict
  componentSelector:
    componentFunctions: [security]
  priority: 1000
---
# Deployment (created by developer)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
  annotations:
    oda.tmforum.org/component-function: "security"
    oda.tmforum.org/availability-class: "standard" # Ignored due to strict policy
spec:
  replicas: 5
  # ...

Result: PDB created with minAvailable: 5 (90% of 5 replicas rounds up to 5), meaning 0 allowed disruptions - the operator enforces maximum protection for security workloads under strict policy.

Example 3: Flexible Policy with Developer Override

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Policy
apiVersion: availability.oda.tmforum.org/v1alpha1
kind: AvailabilityPolicy
metadata:
  name: production-flexible
  namespace: canvas
spec:
  availabilityClass: standard
  enforcement: flexible
  minimumClass: standard
  componentSelector:
    namespaces: [production]
---
# Deployment with higher availability
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
  annotations:
    oda.tmforum.org/availability-class: "mission-critical" # Accepted (higher than minimum)
spec:
  replicas: 10
  # ...

Result: PDB created with minAvailable: 9 (90% of 10 replicas) - developer’s choice honored since it exceeds the minimum.

Example 4: TM Forum ODA Component

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tmf620-productcatalog
  namespace: components
  labels:
    oda.tmforum.org/componentName: productcatalog
    oda.tmforum.org/componentType: core
  annotations:
    oda.tmforum.org/availability-class: "high-availability"
    oda.tmforum.org/component-function: "core"
    oda.tmforum.org/api-specification: "TMF620-ProductCatalog-v4.0.0"
    oda.tmforum.org/maintenance-window: "02:00-04:00 UTC"
spec:
  replicas: 5
  # ...

Result: PDB created with minAvailable: 4 (75% of 5 replicas), automatically suspended during maintenance window (02:00-04:00 UTC).

Observability

Key Metrics

The operator exports comprehensive Prometheus metrics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# PDB lifecycle
pdb_management_pdbs_created_total
pdb_management_pdbs_updated_total
pdb_management_pdbs_deleted_total

# Performance
pdb_management_reconciliation_duration_seconds
pdb_management_cache_hits_total
pdb_management_cache_misses_total

# Policy enforcement
pdb_management_enforcement_decisions_total
pdb_management_override_attempts_total
pdb_management_policy_conflicts_total

# Operator health
pdb_management_operator_info
pdb_management_maintenance_window_active

Structured Audit Logging

Every PDB operation is logged with full context:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
  "level": "info",
  "audit": {
    "action": "PDB_CREATED",
    "resource": "payment-service-pdb",
    "result": "success",
    "metadata": {
      "availabilityClass": "mission-critical",
      "enforcement": "strict",
      "policy": "security-policy",
      "durationMs": 142
    }
  }
}

Distributed Tracing

Full OpenTelemetry integration for end-to-end visibility:

1
2
3
4
# Enable tracing
kubectl set env deployment/pdb-management-controller-manager -n canvas \
  ENABLE_TRACING=true \
  OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

AI-Powered Analysis with MCP

The operator includes an integrated Model Context Protocol (MCP) server that enables AI assistants like Claude to analyze your cluster and provide intelligent recommendations.

Note: MCP integration is optional and does not affect core operator functionality. It is intended for platform teams exploring AI-assisted operations. You can safely ignore this feature if it doesn’t fit your workflow.

flowchart LR
    subgraph Clients["AI Clients"]
        Claude["Claude"]
        GPT["GPT"]
    end

    subgraph MCP["MCP Server :8090"]
        Tools["Analysis Tools"]
    end

    subgraph K8s["Kubernetes"]
        Deploy["Deployments"]
        PDB["PDBs"]
        Policy["Policies"]
    end

    Clients <-->|"JSON-RPC"| MCP
    MCP <-->|"K8s API"| K8s

Available MCP Tools

ToolPurpose
analyze_cluster_availabilityComprehensive PDB coverage analysis
analyze_workload_patternsDeployment pattern detection
recommend_availability_classesOptimal availability suggestions
recommend_policiesPolicy recommendations
simulate_policy_impactPreview policy changes

Example Interactions

Ask Claude to help with your cluster:

  • “Analyze my cluster’s availability status”
  • “What workload patterns do you see in my deployments?”
  • “Recommend availability classes for my production services”
  • “Simulate the impact of applying high-availability to all production deployments”

Installation

Prerequisites

  • Kubernetes 1.21+
  • cert-manager (for webhook TLS certificates)

Quick Deployment

1
2
3
4
5
6
7
# Deploy to canvas namespace
kubectl create namespace canvas
kubectl apply -k config/default/

# Verify installation
kubectl get pods -n canvas
kubectl get crd | grep availability

Tip: If you hit kustomize version mismatches, use kubectl kustomize config/default | kubectl apply -f - instead.

1
2
3
4
5
6
7
8
9
10
11
# Install cert-manager first (use latest stable version for your environment)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

# Deploy operator with webhooks
kubectl apply -k config/certmanager/
kubectl apply -k config/webhook/
kubectl apply -k config/default/

# Enable webhooks
kubectl patch deployment pdb-management-controller-manager -n canvas \
  --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["--leader-elect", "--enable-webhook=true"]}]'

Tip: The demo repository handles cert-manager installation via task setup - use that for a streamlined experience.

Troubleshooting

Common Issues

IssueSymptomsResolution
No PDB CreatedDeployment has no PDBCheck replicas (must be >= 2), verify annotation spelling
Policy ConflictsUnexpected availability classCheck kubectl get availabilitypolicy -A -o wide for priorities
Webhook ErrorsAdmission failuresVerify cert-manager and certificate status
Stale ConfigurationChanges not appliedWait for cache TTL (5 minutes) or restart operator

Debug Mode

1
2
3
# Enable debug logging
kubectl patch deployment pdb-management-controller-manager -n canvas \
  --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": ["--leader-elect", "--log-level=debug"]}]'

Hands-On Demo Repository

To see the PDB Management Operator in action within the full ODA Canvas environment, I’ve created a comprehensive demo repository:

pdb-management-operator-demo

What’s Included

The demo deploys the complete ODA Canvas stack, not just the operator in isolation:

ComponentPurpose
ODA CanvasFull TM Forum ODA Canvas installation
PDB Management OperatorThe availability management operator
Sample ODA ComponentsTMF620, TMF622 reference implementations
AvailabilityPoliciesExample policies for different scenarios
Observability StackPrometheus + Grafana for metrics visualization

Demo Scenarios

The demo walks through real-world scenarios:

  1. Annotation-Based PDB Creation: Deploy a component with availability annotations
  2. Policy Enforcement: See how strict policies override developer annotations
  3. Flexible Mode: Demonstrate minimum class enforcement
  4. Maintenance Windows: Observe PDB suspension during scheduled maintenance
  5. MCP Integration: Use Claude to analyze cluster availability

Quick Start

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Clone the demo repository
git clone https://github.com/NoNickeD/pdb-management-operator-demo.git
cd pdb-management-operator-demo

# Deploy full ODA Canvas with PDB Management Operator
task setup

# Deploy sample components
task deploy:components

# Create availability policies
task deploy:policies

# Watch PDBs being created
kubectl get pdb -A -w

This gives you a complete, working ODA Canvas environment to explore the operator’s capabilities.

Conclusion

Managing Pod Disruption Budgets at scale requires more than just YAML templates - it requires intelligent automation that understands organizational policies, component criticality, and operational contexts like maintenance windows.

The PDB Management Operator provides this automation with:

  • Dual configuration model for both developer flexibility and platform control
  • Three enforcement modes to balance governance with agility
  • Component intelligence that automatically protects security-critical workloads
  • Enterprise observability for compliance and troubleshooting
  • AI integration via MCP for intelligent cluster analysis

Whether you’re managing a handful of services or hundreds of ODA components, this operator ensures consistent, policy-driven availability management without the manual toil.


If you found this useful, you might also enjoy my related posts on Kubernetes operators and platform tooling:

PDB Management Operator

This post is licensed under CC BY 4.0 by the author.