Kratix - Building Self-Service Platform Capabilities for Kubernetes :: SREKubeCraft

Platform engineering is about reducing friction while maintaining guardrails. As organizations scale their Kubernetes footprint across multiple clusters and environments, the need for self-service platform capabilities becomes critical. Teams shouldn’t raise tickets to get observability, databases, or development environments they should be able to provision what they need on demand, within clear boundaries.

Kratix is a framework designed to solve exactly this problem. In this post, I’ll explore what Kratix is, its strengths and limitations, and demonstrate a practical use case: building an on-demand Datadog stack that SREs can install and uninstall across clusters with a simple kubectl command.

Who Should Read This?#

This post is for:

Platform Engineers building internal developer platforms
SREs managing multi-cluster Kubernetes environments
DevOps Leads evaluating platform orchestration tools
Teams in regulated environments needing audit trails and consistent guardrails

If you’re dealing with multi-cluster complexity and want to offer self-service capabilities without sacrificing control, read on.

What is Kratix?#

Kratix is an open-source platform framework built by Syntasso to help you create an internal developer platform exactly the way your organization needs it. It can be thought of as a platform orchestrator that helps platform teams deliver capabilities quickly, safely, and consistently.

Unlike tools that just template resources (Helm) or provision cloud infrastructure (Crossplane), Kratix focuses on platform workflows the complete lifecycle of deploying, configuring, and managing capabilities across multiple clusters.

Why Not Crossplane?#

A common question: “How is Kratix different from Crossplane?” Here’s the key distinction:

Aspect	Crossplane	Kratix
Primary Focus	Cloud resource provisioning (RDS, S3, IAM)	Kubernetes workload orchestration
Multi-Cluster	Requires additional tooling	Built-in Destinations concept
Workflows	Composition-based	Pipeline containers (any logic)
Ecosystem	Rich provider ecosystem	Fewer pre-built Promises
Best For	Infrastructure-as-Code for cloud	Platform-as-a-Product for K8s

Bottom line: Use Crossplane to provision cloud resources, and Kratix to orchestrate Kubernetes workloads and platform capabilities across clusters. They complement each other well.

What Kratix is NOT#

To set clear expectations:

Not a CI/CD replacement - It doesn’t build or deploy your application code
Not a cloud provisioning engine - Use Crossplane or Terraform/OpenTofu for AWS/Azure/GCP resources
Not a Backstage alternative - It complements Backstage (Backstage for UI, Kratix for backend orchestration)
Not a service mesh - It doesn’t handle traffic management or observability collection

Core Concepts#

Promise: A reusable platform capability, a contract between your platform and your application teams (e.g., “Datadog Stack”, “PostgreSQL Database”, “Developer Environment”)
Resource Request: A user’s request to instantiate a Promise (the CRD you kubectl apply)
Destination: A target cluster where resources get deployed
Pipeline: Workflow that transforms a request into actual Kubernetes resources

flowchart LR
    subgraph Platform["Platform Cluster"]
        User[User] -->|kubectl apply| Promise[Promise]
        Promise --> Pipeline[Pipeline]
        Pipeline -->|generates| Manifests[Manifests]
    end

    Manifests -->|commits to| Git[(Git Repository)]
    Git -->|watched by| ArgoCD[ArgoCD]

    subgraph Workload["Workload Clusters"]
        ArgoCD -->|syncs to| Dev[Dev Cluster]
        ArgoCD -->|syncs to| Test[Test Cluster]
        ArgoCD -->|syncs to| Staging[Staging Cluster]
    end

Why This Matters: Promises standardize self-service APIs in a way that scales across clusters, teams, and environments. Instead of each team crafting their own Helm values, they consume a consistent API defined by the platform team.

How It Works#

Platform team defines a Promise - This includes the API (what users can request), the pipeline (how requests are processed), and dependencies (what gets installed on destinations)
User creates a Resource Request - A simple YAML file specifying what they need
Kratix executes the Pipeline - Generates Kubernetes manifests based on the request
State is committed to Git - Manifests are stored in a Git repository
GitOps controller syncs - ArgoCD or Flux deploys the resources to the target cluster
Deletion reverses the process - Removing the Resource Request removes everything

Platform Maturity Model#

Where does Kratix fit in your platform journey?

Maturity Level	Description	Tools
Level 1	Manual operations	kubectl, scripts
Level 2	Templated deployments	Helm, Kustomize
Level 3	GitOps automation	ArgoCD, Flux
Level 4	Multi-cluster GitOps	ApplicationSets, Fleet
Level 5	Self-service platform APIs	Kratix (backend) + Backstage (UI)

Kratix is a Level 5 tool it assumes you already have GitOps in place and builds self-service capabilities on top.

Kratix CLI#

The Kratix CLI is a tool designed to help you build Promises and manage your Kratix installation. It streamlines Promise development through three distinct pathways:

From Scratch - Initialize Promises with custom specifications, then extend them by adding API properties, dependencies, and workflows
From Helm Charts - Auto-generate Promise APIs directly from existing Helm chart values
From Operators - Transform existing Kubernetes Operators into Promises by extracting their CustomResourceDefinitions

Key commands include:

Command	Description
`kratix init promise`	Create a new Promise from scratch
`kratix update api`	Add or remove API fields
`kratix update dependencies`	Add external resources
`kratix add container`	Add workflow stages
`kratix build container`	Build containers for workflows

The CLI significantly reduces the boilerplate needed to create Promises, especially when you’re wrapping existing Helm charts or Operators that you already run in production.

Kratix: Pros and Cons#

Before diving into implementation, let’s evaluate Kratix honestly.

Pros#

Advantage	Description
Kubernetes-Native	Everything is a CRD. If you know kubectl, you know how to use Kratix
GitOps Integration	Built-in support for Git-based state stores, works seamlessly with ArgoCD/Flux
Multi-Cluster by Design	Destinations concept makes multi-cluster deployments first-class citizens
Flexible Pipelines	Use any container image in your pipeline—Helm, Kustomize, custom scripts
Self-Service APIs	Platform teams define the API, users consume it without knowing the implementation
Composable Promises	Promises can depend on other Promises, enabling complex platform capabilities
CLI Tooling	Kratix CLI accelerates Promise development from Helm charts or Operators
No Vendor Lock-in	Open-source, runs anywhere Kubernetes runs

Cons#

Limitation	Description
Learning Curve	New concepts (Promises, Destinations, Pipelines) take time to understand
Pipeline Complexity	Writing pipeline containers requires understanding Kratix’s conventions
Limited Ecosystem	Fewer pre-built Promises compared to Crossplane’s provider ecosystem
State Management	Relies on Git for state; can be tricky with large-scale deployments
Debugging	Pipeline failures can be harder to debug than direct Helm/kubectl
Not CNCF	Unlike Crossplane, Kratix is not a CNCF project (it’s backed by Syntasso)

When to Use Kratix#

Kratix is a strong fit when your organization needs:

Self-service platform capabilities that span multiple clusters
Abstraction over complex deployments (hide Helm complexity behind simple APIs)
GitOps-native workflows with full audit trails
Custom business logic in your provisioning pipelines
Consistent guardrails enforced across all environments

When to Consider Alternatives#

Alternative	Use When
Crossplane	You primarily need cloud resource provisioning (RDS, S3, IAM, etc.)
ArgoCD ApplicationSets	Simple templating across clusters is enough
Helm + CI/CD	You don’t need self-service or multi-cluster
Backstage Templates	You want a UI-first approach with scaffolding

Use Cases for Kratix#

Kratix shines when you need to offer self-service capabilities across multiple clusters. Here are some practical use cases:

Use Case	Promise	Value
Observability Stacks	Datadog, Loki, Tempo	On-demand monitoring for debugging/testing
Databases	PostgreSQL, Redis, MongoDB	Self-service database provisioning
Chaos Engineering	LitmusChaos, ChaosMesh	Run chaos experiments when needed
Load Testing	k6, Locust	Spin up load generators for performance testing
Security Scanning	Falco, Trivy	Enable runtime security for specific tests
Feature Environments	Full app stack	Temporary environments for feature testing
Developer Environments	IDE, tools, dependencies	Pre-configured development setups

The common thread: capabilities that benefit from self-service provisioning, consistent configuration, and lifecycle management across multiple clusters.

Practical Example: On-Demand Datadog Stack#

Let’s walk through a concrete example. We’ll build a Kratix Promise that allows SREs to install and uninstall Datadog across Kubernetes clusters on-demand.

Why This Example?#

Running observability tools like Datadog across all environments 24/7 isn’t always necessary. Dev and test clusters might only need monitoring during active debugging or testing. With Kratix, SREs can:

Install Datadog when needed for debugging
Choose the feature tier (logs only, logs + APM, full stack)
Remove it when done to free resources

Why This Matters: On-demand observability means you pay only for what you use. For non-production environments, this can translate to significant savings potentially hundreds of euros per month depending on your cluster sizes.

Architecture#

The demo implements a single-cluster setup where the platform and workload run together:

flowchart TB
    subgraph Cluster["EKS Cluster"]
        subgraph Platform["Platform Components"]
            Kratix[Kratix Controller]
            ArgoCD[ArgoCD]
            ESO[External Secrets Operator]
            Flux[Flux Helm Controller]
        end

        subgraph Workload["Workload Namespace"]
            DD[Datadog Agents<br/>on-demand]
        end
    end

    subgraph External["External Services"]
        Git[(GitHub Repository<br/>GitStateStore)]
        AWS[(AWS Secrets Manager<br/>API Keys)]
    end

    Kratix -->|1. writes manifests| Git
    ArgoCD -->|2. watches & syncs| Git
    ArgoCD -->|3. deploys| Workload
    ESO -->|fetches secrets| AWS
    Flux -->|manages| DD

For multi-cluster setups, you would add additional Destinations pointing to workload clusters, and ArgoCD would sync to each based on labels.

The DatadogStack API#

Here’s what the API looks like from the SRE perspective:

# SRE applies this to request Datadog
apiVersion: platform.srekubecraft.io/v1alpha1
kind: DatadogStack
metadata:
  name: production
  namespace: default
spec:
  tier: full # minimal, standard, or full
  environment: prod # maps to secret path: datadog/<environment>/api-keys
  clusterName: kratix-demo

One simple YAML file to deploy a fully configured Datadog stack. The environment field determines which AWS Secrets Manager path to use for API keys (datadog/prod/api-keys, datadog/dev/api-keys, etc.). The tiers map to different features and cost implications:

Tier	Features	Resource Usage	Billing Impact	Use Case
minimal	Basic metrics, Cluster Agent	~256MB RAM per node	Lowest	Quick debugging
standard	Metrics + APM + Logs + Service Monitoring	~512MB RAM per node	Medium (APM costs)	Testing with traces
full	All features (NPM, Security, Process, etc.)	~2GB RAM per node	Highest	Production monitoring

Request Lifecycle#

sequenceDiagram
    participant SRE
    participant Platform as Platform Cluster
    participant Kratix
    participant Pipeline
    participant Git as Git Repository
    participant ArgoCD
    participant Workload as Workload Cluster

    SRE->>Platform: kubectl apply DatadogStack
    Platform->>Kratix: Resource created
    Kratix->>Pipeline: Execute pipeline
    Pipeline->>Pipeline: Generate Helm values
    Pipeline->>Pipeline: Template Datadog manifests
    Pipeline->>Git: Commit manifests
    Git->>ArgoCD: Change detected
    ArgoCD->>Workload: Sync resources
    Workload->>Workload: Deploy Datadog agents

    Note over SRE,Workload: Datadog now running

    SRE->>Platform: kubectl delete DatadogStack
    Platform->>Kratix: Resource deleted
    Kratix->>Git: Remove manifests
    Git->>ArgoCD: Change detected
    ArgoCD->>Workload: Prune resources
    Workload->>Workload: Remove Datadog agents

    Note over SRE,Workload: Datadog removed

Demo Walkthrough#

Request Datadog with Full Tier:

# Apply the DatadogStack resource
kubectl apply -f promises/examples/full-tier.yaml

# Watch Kratix process the request
kubectl get datadogstacks -w

Output:

NAME         AGE   STATUS
production   0s    Pending
production   5s    Configuring
production   30s   Ready

Watch the Pipeline Execute:

# The pipeline runs as a pod
kubectl get pods | grep kratix-datadog

# Check pipeline logs
kubectl logs -l kratix.io/promise-name=datadog-stack

Verify Datadog is Running:

# Check the namespace was created
kubectl get ns | grep datadog

# Check Datadog pods (5 containers per agent in full tier)
kubectl get pods -n datadog-production

Output:

NAME                                     READY   STATUS    RESTARTS   AGE
datadog-xxxxx                            5/5     Running   0          2m
datadog-yyyyy                            5/5     Running   0          2m
datadog-cluster-agent-zzzzz              1/1     Running   0          2m

Upgrade from Minimal to Standard Tier:

spec:
  tier: standard # Changed from minimal
  environment: dev
  clusterName: kratix-demo

Apply the change, and Kratix re-runs the pipeline with APM and logs enabled.

Remove Datadog:

kubectl delete datadogstack production

Kratix removes manifests from Git, ArgoCD prunes resources, Datadog is gone.

Security Considerations#

When implementing Kratix in production, consider these security aspects:

GitOps Write Permissions#

Kratix pipelines write to your Git repository. Ensure:

The Git credentials have minimal required permissions (write to specific paths only)
Use deploy keys or service accounts rather than personal tokens
Enable branch protection on your main branch
Consider signed commits for audit compliance

Pipeline Container Trust#

Pipeline containers execute arbitrary logic. Mitigate risks by:

Using private container registries with image scanning
Implementing image signing (Cosign, Notary)
Pinning images to specific digests, not just tags
Running pipelines with minimal RBAC permissions

Secrets Handling#

For our Datadog example, API keys are managed through AWS Secrets Manager and External Secrets Operator:

Store Datadog API keys in AWS Secrets Manager at path datadog/<environment>/api-keys
The pipeline generates an ExternalSecret that fetches keys automatically
Use IRSA (IAM Roles for Service Accounts) for secure AWS authentication
Separate secrets per environment (dev, staging, prod) for isolation
The HelmRelease references the synced Kubernetes Secret via valuesFrom

# ExternalSecret generated by pipeline
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: datadog-api-key
  namespace: datadog-production
spec:
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  data:
    - secretKey: api-key
      remoteRef:
        key: datadog/prod/api-keys
        property: api-key

Troubleshooting Common Issues#

When things go wrong, here’s where to look:

Issue	Symptoms	Resolution
Pipeline Failures	DatadogStack stuck in “Configuring”	Check pipeline pod logs: `kubectl logs -l kratix.io/promise-name=datadog-stack`
InvalidImageName	Pipeline pod fails with image error	Ensure image name is lowercase (Docker registry requirement)
Git Write Errors	Manifests not appearing in repo	Check `kubectl get workplacements -A` and GitStateStore secret
ArgoCD OutOfSync	Resources not deployed	Force refresh: `kubectl annotate app <name> -n argocd argocd.argoproj.io/refresh=hard`
ExternalSecret Errors	Secret not synced	Check IAM policy allows access to the secret path, verify ClusterSecretStore status
OOMKilled Pods	Datadog agents restarting	Increase memory limits in tier values file, rebuild pipeline
HelmRelease/HelmRepository	API not found errors	Install Flux helm-controller and source-controller
Resource Conflicts	Helm errors about existing resources	Ensure namespace is clean, check for orphaned resources from previous installs

Benefits for Platform Teams#

Benefit	Description
Self-Service Without Tickets	SREs provision what they need without waiting for approvals
Consistency	Every installation uses the same configuration and tagging conventions
Full Audit Trail	Every change is a Git commit—complete history of who did what and when
Resource Optimization	Install tools only when needed, remove when done
Reduced Overhead	Define the Promise once, SREs consume it across all environments

Hands-On Demo Repository#

I’ve built a complete demo repository that implements everything discussed in this post:

kratix-demo

Demo Structure#

flowchart TD
    subgraph Repo["Demo Repository"]
        TF[iac/]
        K8s[kubernetes/]
        Promises[promises/]
        GitOps[gitops/]
        GH[.github/workflows/]
    end

    TF -->|provisions| Infra[EKS Cluster + IAM]
    K8s -->|installs| Platform[ArgoCD + Kratix + ESO + Flux]
    Promises -->|defines| DD[DatadogStack Promise + Go Pipeline]
    GitOps -->|stores| State[Kratix Generated Manifests]
    GH -->|automates| CI[Pipeline Build + Promise Update]

What’s Included#

Directory	Contents
`iac/`	OpenTofu for EKS cluster, VPC, and IRSA
`kubernetes/argocd/`	ArgoCD OpenTofu setup and Application manifests
`kubernetes/kratix/`	GitStateStore and Destination configuration
`kubernetes/external-secrets/`	ClusterSecretStore and ExternalSecret definitions
`promises/`	Promise definition and example resources
`promises/pipelines/`	Go pipeline with tiered Helm values
`gitops/platform/`	Kratix-generated manifests (auto-populated)
`.github/workflows/`	CI/CD for pipeline image build and Promise updates

Repository Architecture#

The demo uses a single-cluster setup where the platform cluster also serves as the workload cluster:

ArgoCD - GitOps controller for all deployments
Kratix - Platform orchestrator for self-service APIs
External Secrets Operator - Fetches Datadog API keys from AWS Secrets Manager
Flux Helm Controller - Manages HelmReleases generated by Kratix

Quick Start#

The repository includes a Taskfile for automation:

# Clone the repository
git clone https://github.com/NoNickeD/kratix-demo.git
cd kratix-demo

# Full setup (infrastructure + ArgoCD + platform)
task setup

# Or step by step:
task infra:apply      # Deploy EKS cluster
task kubeconfig       # Configure kubectl
task argocd:apply     # Deploy ArgoCD
task argocd:apps      # Apply ArgoCD Applications
task kratix:apply     # Configure Kratix
task promise:apply    # Install DatadogStack Promise

# Deploy Datadog (choose a tier)
task datadog:full     # or datadog:standard, datadog:minimal

# Check status
task status

# Remove Datadog
task datadog:delete

# Full teardown
task teardown

Taskfile Commands#

Command	Description
`task setup`	Full setup (infra + ArgoCD + platform)
`task teardown`	Full teardown
`task status`	Show status of all components
`task datadog:full`	Deploy Datadog with full tier
`task datadog:standard`	Deploy Datadog with standard tier
`task datadog:minimal`	Deploy Datadog with minimal tier
`task datadog:delete`	Remove all DatadogStack resources
`task argocd:password`	Get ArgoCD admin password
`task argocd:port-forward`	Port forward ArgoCD UI
`task pipeline:trigger`	Trigger pipeline build workflow

CI/CD Pipeline#

The GitHub Actions workflow automatically:

Lints the Go pipeline code with golangci-lint
Builds and pushes the Docker image to ghcr.io
Updates promises/promise.yaml with the new image tag
Commits the change back to the repository

This ensures the Promise always uses the latest pipeline image.

Conclusion#

Kratix provides a powerful framework for platform teams to offer self-service capabilities across Kubernetes clusters. Its Kubernetes-native approach, GitOps integration, flexible pipeline system, and CLI tooling make it well-suited for organizations building internal developer platforms.

The on-demand Datadog example demonstrates how complex deployments can be abstracted behind simple APIs, giving SREs and developers the tools they need while maintaining consistency and control. The same pattern applies to databases, chaos engineering tools, load testing infrastructure, and any other capability your platform needs to offer.

Whether Kratix is right for your organization depends on your specific needs:

Choose Kratix if you need self-service capabilities with GitOps workflows and multi-cluster support
Choose Crossplane if your focus is cloud resource provisioning
Choose ArgoCD ApplicationSets if simple templating across clusters is enough
Combine them if you need both cloud resources and platform capabilities

The key takeaway: platform engineering is about reducing friction while maintaining guardrails. Tools like Kratix help achieve that balance.

If you found this useful, you might also enjoy my related posts on observability and platform tooling: