Post

Kratix - Building Self-Service Platform Capabilities for Kubernetes

Platform engineering is about reducing friction while maintaining guardrails. As organizations scale their Kubernetes footprint across multiple clusters and environments, the need for self-service platform capabilities becomes critical. Teams shouldn’t raise tickets to get observability, databases, or development environments they should be able to provision what they need on demand, within clear boundaries.

Kratix is a framework designed to solve exactly this problem. In this post, I’ll explore what Kratix is, its strengths and limitations, and demonstrate a practical use case: building an on-demand Datadog stack that SREs can install and uninstall across clusters with a simple kubectl command.

Who Should Read This?

This post is for:

  • Platform Engineers building internal developer platforms
  • SREs managing multi-cluster Kubernetes environments
  • DevOps Leads evaluating platform orchestration tools
  • Teams in regulated environments needing audit trails and consistent guardrails

If you’re dealing with multi-cluster complexity and want to offer self-service capabilities without sacrificing control, read on.

What is Kratix?

Kratix is an open-source platform framework built by Syntasso to help you create an internal developer platform exactly the way your organization needs it. It can be thought of as a platform orchestrator that helps platform teams deliver capabilities quickly, safely, and consistently.

Unlike tools that just template resources (Helm) or provision cloud infrastructure (Crossplane), Kratix focuses on platform workflows the complete lifecycle of deploying, configuring, and managing capabilities across multiple clusters.

Why Not Crossplane?

A common question: “How is Kratix different from Crossplane?” Here’s the key distinction:

AspectCrossplaneKratix
Primary FocusCloud resource provisioning (RDS, S3, IAM)Kubernetes workload orchestration
Multi-ClusterRequires additional toolingBuilt-in Destinations concept
WorkflowsComposition-basedPipeline containers (any logic)
EcosystemRich provider ecosystemFewer pre-built Promises
Best ForInfrastructure-as-Code for cloudPlatform-as-a-Product for K8s

Bottom line: Use Crossplane to provision cloud resources, and Kratix to orchestrate Kubernetes workloads and platform capabilities across clusters. They complement each other well.

What Kratix is NOT

To set clear expectations:

  • Not a CI/CD replacement - It doesn’t build or deploy your application code
  • Not a cloud provisioning engine - Use Crossplane or Terraform/OpenTofu for AWS/Azure/GCP resources
  • Not a Backstage alternative - It complements Backstage (Backstage for UI, Kratix for backend orchestration)
  • Not a service mesh - It doesn’t handle traffic management or observability collection

Core Concepts

  • Promise: A reusable platform capability, a contract between your platform and your application teams (e.g., “Datadog Stack”, “PostgreSQL Database”, “Developer Environment”)
  • Resource Request: A user’s request to instantiate a Promise (the CRD you kubectl apply)
  • Destination: A target cluster where resources get deployed
  • Pipeline: Workflow that transforms a request into actual Kubernetes resources
flowchart LR
    subgraph Platform["Platform Cluster"]
        User[User] -->|kubectl apply| Promise[Promise]
        Promise --> Pipeline[Pipeline]
        Pipeline -->|generates| Manifests[Manifests]
    end

    Manifests -->|commits to| Git[(Git Repository)]
    Git -->|watched by| ArgoCD[ArgoCD]

    subgraph Workload["Workload Clusters"]
        ArgoCD -->|syncs to| Dev[Dev Cluster]
        ArgoCD -->|syncs to| Test[Test Cluster]
        ArgoCD -->|syncs to| Staging[Staging Cluster]
    end

Why This Matters: Promises standardize self-service APIs in a way that scales across clusters, teams, and environments. Instead of each team crafting their own Helm values, they consume a consistent API defined by the platform team.

How It Works

  1. Platform team defines a Promise - This includes the API (what users can request), the pipeline (how requests are processed), and dependencies (what gets installed on destinations)
  2. User creates a Resource Request - A simple YAML file specifying what they need
  3. Kratix executes the Pipeline - Generates Kubernetes manifests based on the request
  4. State is committed to Git - Manifests are stored in a Git repository
  5. GitOps controller syncs - ArgoCD or Flux deploys the resources to the target cluster
  6. Deletion reverses the process - Removing the Resource Request removes everything

Platform Maturity Model

Where does Kratix fit in your platform journey?

Maturity LevelDescriptionTools
Level 1Manual operationskubectl, scripts
Level 2Templated deploymentsHelm, Kustomize
Level 3GitOps automationArgoCD, Flux
Level 4Multi-cluster GitOpsApplicationSets, Fleet
Level 5Self-service platform APIsKratix (backend) + Backstage (UI)

Kratix is a Level 5 tool it assumes you already have GitOps in place and builds self-service capabilities on top.

Kratix CLI

The Kratix CLI is a tool designed to help you build Promises and manage your Kratix installation. It streamlines Promise development through three distinct pathways:

  • From Scratch - Initialize Promises with custom specifications, then extend them by adding API properties, dependencies, and workflows
  • From Helm Charts - Auto-generate Promise APIs directly from existing Helm chart values
  • From Operators - Transform existing Kubernetes Operators into Promises by extracting their CustomResourceDefinitions

Key commands include:

CommandDescription
kratix init promiseCreate a new Promise from scratch
kratix update apiAdd or remove API fields
kratix update dependenciesAdd external resources
kratix add containerAdd workflow stages
kratix build containerBuild containers for workflows

The CLI significantly reduces the boilerplate needed to create Promises, especially when you’re wrapping existing Helm charts or Operators that you already run in production.

Kratix: Pros and Cons

Before diving into implementation, let’s evaluate Kratix honestly.

Pros

AdvantageDescription
Kubernetes-NativeEverything is a CRD. If you know kubectl, you know how to use Kratix
GitOps IntegrationBuilt-in support for Git-based state stores, works seamlessly with ArgoCD/Flux
Multi-Cluster by DesignDestinations concept makes multi-cluster deployments first-class citizens
Flexible PipelinesUse any container image in your pipeline—Helm, Kustomize, custom scripts
Self-Service APIsPlatform teams define the API, users consume it without knowing the implementation
Composable PromisesPromises can depend on other Promises, enabling complex platform capabilities
CLI ToolingKratix CLI accelerates Promise development from Helm charts or Operators
No Vendor Lock-inOpen-source, runs anywhere Kubernetes runs

Cons

LimitationDescription
Learning CurveNew concepts (Promises, Destinations, Pipelines) take time to understand
Pipeline ComplexityWriting pipeline containers requires understanding Kratix’s conventions
Limited EcosystemFewer pre-built Promises compared to Crossplane’s provider ecosystem
State ManagementRelies on Git for state; can be tricky with large-scale deployments
DebuggingPipeline failures can be harder to debug than direct Helm/kubectl
Not CNCFUnlike Crossplane, Kratix is not a CNCF project (it’s backed by Syntasso)

When to Use Kratix

Kratix is a strong fit when your organization needs:

  • Self-service platform capabilities that span multiple clusters
  • Abstraction over complex deployments (hide Helm complexity behind simple APIs)
  • GitOps-native workflows with full audit trails
  • Custom business logic in your provisioning pipelines
  • Consistent guardrails enforced across all environments

When to Consider Alternatives

AlternativeUse When
CrossplaneYou primarily need cloud resource provisioning (RDS, S3, IAM, etc.)
ArgoCD ApplicationSetsSimple templating across clusters is enough
Helm + CI/CDYou don’t need self-service or multi-cluster
Backstage TemplatesYou want a UI-first approach with scaffolding

Use Cases for Kratix

Kratix shines when you need to offer self-service capabilities across multiple clusters. Here are some practical use cases:

Use CasePromiseValue
Observability StacksDatadog, Loki, TempoOn-demand monitoring for debugging/testing
DatabasesPostgreSQL, Redis, MongoDBSelf-service database provisioning
Chaos EngineeringLitmusChaos, ChaosMeshRun chaos experiments when needed
Load Testingk6, LocustSpin up load generators for performance testing
Security ScanningFalco, TrivyEnable runtime security for specific tests
Feature EnvironmentsFull app stackTemporary environments for feature testing
Developer EnvironmentsIDE, tools, dependenciesPre-configured development setups

The common thread: capabilities that benefit from self-service provisioning, consistent configuration, and lifecycle management across multiple clusters.

Practical Example: On-Demand Datadog Stack

Let’s walk through a concrete example. We’ll build a Kratix Promise that allows SREs to install and uninstall Datadog across Kubernetes clusters on-demand.

Why This Example?

Running observability tools like Datadog across all environments 24/7 isn’t always necessary. Dev and test clusters might only need monitoring during active debugging or testing. With Kratix, SREs can:

  • Install Datadog when needed for debugging
  • Choose the feature tier (logs only, logs + APM, full stack)
  • Remove it when done to free resources

Why This Matters: On-demand observability means you pay only for what you use. For non-production environments, this can translate to significant savings potentially hundreds of euros per month depending on your cluster sizes.

Architecture

The demo implements a single-cluster setup where the platform and workload run together:

flowchart TB
    subgraph Cluster["EKS Cluster"]
        subgraph Platform["Platform Components"]
            Kratix[Kratix Controller]
            ArgoCD[ArgoCD]
            ESO[External Secrets Operator]
            Flux[Flux Helm Controller]
        end

        subgraph Workload["Workload Namespace"]
            DD[Datadog Agents<br/>on-demand]
        end
    end

    subgraph External["External Services"]
        Git[(GitHub Repository<br/>GitStateStore)]
        AWS[(AWS Secrets Manager<br/>API Keys)]
    end

    Kratix -->|1. writes manifests| Git
    ArgoCD -->|2. watches & syncs| Git
    ArgoCD -->|3. deploys| Workload
    ESO -->|fetches secrets| AWS
    Flux -->|manages| DD

For multi-cluster setups, you would add additional Destinations pointing to workload clusters, and ArgoCD would sync to each based on labels.

The DatadogStack API

Here’s what the API looks like from the SRE perspective:

1
2
3
4
5
6
7
8
9
10
# SRE applies this to request Datadog
apiVersion: platform.srekubecraft.io/v1alpha1
kind: DatadogStack
metadata:
  name: production
  namespace: default
spec:
  tier: full # minimal, standard, or full
  environment: prod # maps to secret path: datadog/<environment>/api-keys
  clusterName: kratix-demo

One simple YAML file to deploy a fully configured Datadog stack. The environment field determines which AWS Secrets Manager path to use for API keys (datadog/prod/api-keys, datadog/dev/api-keys, etc.). The tiers map to different features and cost implications:

TierFeaturesResource UsageBilling ImpactUse Case
minimalBasic metrics, Cluster Agent~256MB RAM per nodeLowestQuick debugging
standardMetrics + APM + Logs + Service Monitoring~512MB RAM per nodeMedium (APM costs)Testing with traces
fullAll features (NPM, Security, Process, etc.)~2GB RAM per nodeHighestProduction monitoring

Request Lifecycle

sequenceDiagram
    participant SRE
    participant Platform as Platform Cluster
    participant Kratix
    participant Pipeline
    participant Git as Git Repository
    participant ArgoCD
    participant Workload as Workload Cluster

    SRE->>Platform: kubectl apply DatadogStack
    Platform->>Kratix: Resource created
    Kratix->>Pipeline: Execute pipeline
    Pipeline->>Pipeline: Generate Helm values
    Pipeline->>Pipeline: Template Datadog manifests
    Pipeline->>Git: Commit manifests
    Git->>ArgoCD: Change detected
    ArgoCD->>Workload: Sync resources
    Workload->>Workload: Deploy Datadog agents

    Note over SRE,Workload: Datadog now running

    SRE->>Platform: kubectl delete DatadogStack
    Platform->>Kratix: Resource deleted
    Kratix->>Git: Remove manifests
    Git->>ArgoCD: Change detected
    ArgoCD->>Workload: Prune resources
    Workload->>Workload: Remove Datadog agents

    Note over SRE,Workload: Datadog removed

Demo Walkthrough

Request Datadog with Full Tier:

1
2
3
4
5
# Apply the DatadogStack resource
kubectl apply -f promises/examples/full-tier.yaml

# Watch Kratix process the request
kubectl get datadogstacks -w

Output:

1
2
3
4
NAME         AGE   STATUS
production   0s    Pending
production   5s    Configuring
production   30s   Ready

Watch the Pipeline Execute:

1
2
3
4
5
# The pipeline runs as a pod
kubectl get pods | grep kratix-datadog

# Check pipeline logs
kubectl logs -l kratix.io/promise-name=datadog-stack

Verify Datadog is Running:

1
2
3
4
5
# Check the namespace was created
kubectl get ns | grep datadog

# Check Datadog pods (5 containers per agent in full tier)
kubectl get pods -n datadog-production

Output:

1
2
3
4
NAME                                     READY   STATUS    RESTARTS   AGE
datadog-xxxxx                            5/5     Running   0          2m
datadog-yyyyy                            5/5     Running   0          2m
datadog-cluster-agent-zzzzz              1/1     Running   0          2m

Upgrade from Minimal to Standard Tier:

1
2
3
4
spec:
  tier: standard # Changed from minimal
  environment: dev
  clusterName: kratix-demo

Apply the change, and Kratix re-runs the pipeline with APM and logs enabled.

Remove Datadog:

1
kubectl delete datadogstack production

Kratix removes manifests from Git, ArgoCD prunes resources, Datadog is gone.

Security Considerations

When implementing Kratix in production, consider these security aspects:

GitOps Write Permissions

Kratix pipelines write to your Git repository. Ensure:

  • The Git credentials have minimal required permissions (write to specific paths only)
  • Use deploy keys or service accounts rather than personal tokens
  • Enable branch protection on your main branch
  • Consider signed commits for audit compliance

Pipeline Container Trust

Pipeline containers execute arbitrary logic. Mitigate risks by:

  • Using private container registries with image scanning
  • Implementing image signing (Cosign, Notary)
  • Pinning images to specific digests, not just tags
  • Running pipelines with minimal RBAC permissions

Secrets Handling

For our Datadog example, API keys are managed through AWS Secrets Manager and External Secrets Operator:

  • Store Datadog API keys in AWS Secrets Manager at path datadog/<environment>/api-keys
  • The pipeline generates an ExternalSecret that fetches keys automatically
  • Use IRSA (IAM Roles for Service Accounts) for secure AWS authentication
  • Separate secrets per environment (dev, staging, prod) for isolation
  • The HelmRelease references the synced Kubernetes Secret via valuesFrom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# ExternalSecret generated by pipeline
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: datadog-api-key
  namespace: datadog-production
spec:
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  data:
    - secretKey: api-key
      remoteRef:
        key: datadog/prod/api-keys
        property: api-key

Troubleshooting Common Issues

When things go wrong, here’s where to look:

IssueSymptomsResolution
Pipeline FailuresDatadogStack stuck in “Configuring”Check pipeline pod logs: kubectl logs -l kratix.io/promise-name=datadog-stack
InvalidImageNamePipeline pod fails with image errorEnsure image name is lowercase (Docker registry requirement)
Git Write ErrorsManifests not appearing in repoCheck kubectl get workplacements -A and GitStateStore secret
ArgoCD OutOfSyncResources not deployedForce refresh: kubectl annotate app <name> -n argocd argocd.argoproj.io/refresh=hard
ExternalSecret ErrorsSecret not syncedCheck IAM policy allows access to the secret path, verify ClusterSecretStore status
OOMKilled PodsDatadog agents restartingIncrease memory limits in tier values file, rebuild pipeline
HelmRelease/HelmRepositoryAPI not found errorsInstall Flux helm-controller and source-controller
Resource ConflictsHelm errors about existing resourcesEnsure namespace is clean, check for orphaned resources from previous installs

Benefits for Platform Teams

BenefitDescription
Self-Service Without TicketsSREs provision what they need without waiting for approvals
ConsistencyEvery installation uses the same configuration and tagging conventions
Full Audit TrailEvery change is a Git commit—complete history of who did what and when
Resource OptimizationInstall tools only when needed, remove when done
Reduced OverheadDefine the Promise once, SREs consume it across all environments

Hands-On Demo Repository

I’ve built a complete demo repository that implements everything discussed in this post:

kratix-demo

Demo Structure

flowchart TD
    subgraph Repo["Demo Repository"]
        TF[iac/]
        K8s[kubernetes/]
        Promises[promises/]
        GitOps[gitops/]
        GH[.github/workflows/]
    end

    TF -->|provisions| Infra[EKS Cluster + IAM]
    K8s -->|installs| Platform[ArgoCD + Kratix + ESO + Flux]
    Promises -->|defines| DD[DatadogStack Promise + Go Pipeline]
    GitOps -->|stores| State[Kratix Generated Manifests]
    GH -->|automates| CI[Pipeline Build + Promise Update]

What’s Included

DirectoryContents
iac/OpenTofu for EKS cluster, VPC, and IRSA
kubernetes/argocd/ArgoCD OpenTofu setup and Application manifests
kubernetes/kratix/GitStateStore and Destination configuration
kubernetes/external-secrets/ClusterSecretStore and ExternalSecret definitions
promises/Promise definition and example resources
promises/pipelines/Go pipeline with tiered Helm values
gitops/platform/Kratix-generated manifests (auto-populated)
.github/workflows/CI/CD for pipeline image build and Promise updates

Repository Architecture

The demo uses a single-cluster setup where the platform cluster also serves as the workload cluster:

  • ArgoCD - GitOps controller for all deployments
  • Kratix - Platform orchestrator for self-service APIs
  • External Secrets Operator - Fetches Datadog API keys from AWS Secrets Manager
  • Flux Helm Controller - Manages HelmReleases generated by Kratix

Quick Start

The repository includes a Taskfile for automation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Clone the repository
git clone https://github.com/NoNickeD/kratix-demo.git
cd kratix-demo

# Full setup (infrastructure + ArgoCD + platform)
task setup

# Or step by step:
task infra:apply      # Deploy EKS cluster
task kubeconfig       # Configure kubectl
task argocd:apply     # Deploy ArgoCD
task argocd:apps      # Apply ArgoCD Applications
task kratix:apply     # Configure Kratix
task promise:apply    # Install DatadogStack Promise

# Deploy Datadog (choose a tier)
task datadog:full     # or datadog:standard, datadog:minimal

# Check status
task status

# Remove Datadog
task datadog:delete

# Full teardown
task teardown

Taskfile Commands

CommandDescription
task setupFull setup (infra + ArgoCD + platform)
task teardownFull teardown
task statusShow status of all components
task datadog:fullDeploy Datadog with full tier
task datadog:standardDeploy Datadog with standard tier
task datadog:minimalDeploy Datadog with minimal tier
task datadog:deleteRemove all DatadogStack resources
task argocd:passwordGet ArgoCD admin password
task argocd:port-forwardPort forward ArgoCD UI
task pipeline:triggerTrigger pipeline build workflow

CI/CD Pipeline

The GitHub Actions workflow automatically:

  1. Lints the Go pipeline code with golangci-lint
  2. Builds and pushes the Docker image to ghcr.io
  3. Updates promises/promise.yaml with the new image tag
  4. Commits the change back to the repository

This ensures the Promise always uses the latest pipeline image.

Conclusion

Kratix provides a powerful framework for platform teams to offer self-service capabilities across Kubernetes clusters. Its Kubernetes-native approach, GitOps integration, flexible pipeline system, and CLI tooling make it well-suited for organizations building internal developer platforms.

The on-demand Datadog example demonstrates how complex deployments can be abstracted behind simple APIs, giving SREs and developers the tools they need while maintaining consistency and control. The same pattern applies to databases, chaos engineering tools, load testing infrastructure, and any other capability your platform needs to offer.

Whether Kratix is right for your organization depends on your specific needs:

  • Choose Kratix if you need self-service capabilities with GitOps workflows and multi-cluster support
  • Choose Crossplane if your focus is cloud resource provisioning
  • Choose ArgoCD ApplicationSets if simple templating across clusters is enough
  • Combine them if you need both cloud resources and platform capabilities

The key takeaway: platform engineering is about reducing friction while maintaining guardrails. Tools like Kratix help achieve that balance.


If you found this useful, you might also enjoy my related posts on observability and platform tooling:

Kratix

This post is licensed under CC BY 4.0 by the author.