Kratix - Building Self-Service Platform Capabilities for Kubernetes
Platform engineering is about reducing friction while maintaining guardrails. As organizations scale their Kubernetes footprint across multiple clusters and environments, the need for self-service platform capabilities becomes critical. Teams shouldn’t raise tickets to get observability, databases, or development environments they should be able to provision what they need on demand, within clear boundaries.
Kratix is a framework designed to solve exactly this problem. In this post, I’ll explore what Kratix is, its strengths and limitations, and demonstrate a practical use case: building an on-demand Datadog stack that SREs can install and uninstall across clusters with a simple kubectl command.
Who Should Read This?
This post is for:
- Platform Engineers building internal developer platforms
- SREs managing multi-cluster Kubernetes environments
- DevOps Leads evaluating platform orchestration tools
- Teams in regulated environments needing audit trails and consistent guardrails
If you’re dealing with multi-cluster complexity and want to offer self-service capabilities without sacrificing control, read on.
What is Kratix?
Kratix is an open-source platform framework built by Syntasso to help you create an internal developer platform exactly the way your organization needs it. It can be thought of as a platform orchestrator that helps platform teams deliver capabilities quickly, safely, and consistently.
Unlike tools that just template resources (Helm) or provision cloud infrastructure (Crossplane), Kratix focuses on platform workflows the complete lifecycle of deploying, configuring, and managing capabilities across multiple clusters.
Why Not Crossplane?
A common question: “How is Kratix different from Crossplane?” Here’s the key distinction:
| Aspect | Crossplane | Kratix |
|---|---|---|
| Primary Focus | Cloud resource provisioning (RDS, S3, IAM) | Kubernetes workload orchestration |
| Multi-Cluster | Requires additional tooling | Built-in Destinations concept |
| Workflows | Composition-based | Pipeline containers (any logic) |
| Ecosystem | Rich provider ecosystem | Fewer pre-built Promises |
| Best For | Infrastructure-as-Code for cloud | Platform-as-a-Product for K8s |
Bottom line: Use Crossplane to provision cloud resources, and Kratix to orchestrate Kubernetes workloads and platform capabilities across clusters. They complement each other well.
What Kratix is NOT
To set clear expectations:
- Not a CI/CD replacement - It doesn’t build or deploy your application code
- Not a cloud provisioning engine - Use Crossplane or Terraform/OpenTofu for AWS/Azure/GCP resources
- Not a Backstage alternative - It complements Backstage (Backstage for UI, Kratix for backend orchestration)
- Not a service mesh - It doesn’t handle traffic management or observability collection
Core Concepts
- Promise: A reusable platform capability, a contract between your platform and your application teams (e.g., “Datadog Stack”, “PostgreSQL Database”, “Developer Environment”)
- Resource Request: A user’s request to instantiate a Promise (the CRD you
kubectl apply) - Destination: A target cluster where resources get deployed
- Pipeline: Workflow that transforms a request into actual Kubernetes resources
flowchart LR
subgraph Platform["Platform Cluster"]
User[User] -->|kubectl apply| Promise[Promise]
Promise --> Pipeline[Pipeline]
Pipeline -->|generates| Manifests[Manifests]
end
Manifests -->|commits to| Git[(Git Repository)]
Git -->|watched by| ArgoCD[ArgoCD]
subgraph Workload["Workload Clusters"]
ArgoCD -->|syncs to| Dev[Dev Cluster]
ArgoCD -->|syncs to| Test[Test Cluster]
ArgoCD -->|syncs to| Staging[Staging Cluster]
end
Why This Matters: Promises standardize self-service APIs in a way that scales across clusters, teams, and environments. Instead of each team crafting their own Helm values, they consume a consistent API defined by the platform team.
How It Works
- Platform team defines a Promise - This includes the API (what users can request), the pipeline (how requests are processed), and dependencies (what gets installed on destinations)
- User creates a Resource Request - A simple YAML file specifying what they need
- Kratix executes the Pipeline - Generates Kubernetes manifests based on the request
- State is committed to Git - Manifests are stored in a Git repository
- GitOps controller syncs - ArgoCD or Flux deploys the resources to the target cluster
- Deletion reverses the process - Removing the Resource Request removes everything
Platform Maturity Model
Where does Kratix fit in your platform journey?
| Maturity Level | Description | Tools |
|---|---|---|
| Level 1 | Manual operations | kubectl, scripts |
| Level 2 | Templated deployments | Helm, Kustomize |
| Level 3 | GitOps automation | ArgoCD, Flux |
| Level 4 | Multi-cluster GitOps | ApplicationSets, Fleet |
| Level 5 | Self-service platform APIs | Kratix (backend) + Backstage (UI) |
Kratix is a Level 5 tool it assumes you already have GitOps in place and builds self-service capabilities on top.
Kratix CLI
The Kratix CLI is a tool designed to help you build Promises and manage your Kratix installation. It streamlines Promise development through three distinct pathways:
- From Scratch - Initialize Promises with custom specifications, then extend them by adding API properties, dependencies, and workflows
- From Helm Charts - Auto-generate Promise APIs directly from existing Helm chart values
- From Operators - Transform existing Kubernetes Operators into Promises by extracting their CustomResourceDefinitions
Key commands include:
| Command | Description |
|---|---|
kratix init promise | Create a new Promise from scratch |
kratix update api | Add or remove API fields |
kratix update dependencies | Add external resources |
kratix add container | Add workflow stages |
kratix build container | Build containers for workflows |
The CLI significantly reduces the boilerplate needed to create Promises, especially when you’re wrapping existing Helm charts or Operators that you already run in production.
Kratix: Pros and Cons
Before diving into implementation, let’s evaluate Kratix honestly.
Pros
| Advantage | Description |
|---|---|
| Kubernetes-Native | Everything is a CRD. If you know kubectl, you know how to use Kratix |
| GitOps Integration | Built-in support for Git-based state stores, works seamlessly with ArgoCD/Flux |
| Multi-Cluster by Design | Destinations concept makes multi-cluster deployments first-class citizens |
| Flexible Pipelines | Use any container image in your pipeline—Helm, Kustomize, custom scripts |
| Self-Service APIs | Platform teams define the API, users consume it without knowing the implementation |
| Composable Promises | Promises can depend on other Promises, enabling complex platform capabilities |
| CLI Tooling | Kratix CLI accelerates Promise development from Helm charts or Operators |
| No Vendor Lock-in | Open-source, runs anywhere Kubernetes runs |
Cons
| Limitation | Description |
|---|---|
| Learning Curve | New concepts (Promises, Destinations, Pipelines) take time to understand |
| Pipeline Complexity | Writing pipeline containers requires understanding Kratix’s conventions |
| Limited Ecosystem | Fewer pre-built Promises compared to Crossplane’s provider ecosystem |
| State Management | Relies on Git for state; can be tricky with large-scale deployments |
| Debugging | Pipeline failures can be harder to debug than direct Helm/kubectl |
| Not CNCF | Unlike Crossplane, Kratix is not a CNCF project (it’s backed by Syntasso) |
When to Use Kratix
Kratix is a strong fit when your organization needs:
- Self-service platform capabilities that span multiple clusters
- Abstraction over complex deployments (hide Helm complexity behind simple APIs)
- GitOps-native workflows with full audit trails
- Custom business logic in your provisioning pipelines
- Consistent guardrails enforced across all environments
When to Consider Alternatives
| Alternative | Use When |
|---|---|
| Crossplane | You primarily need cloud resource provisioning (RDS, S3, IAM, etc.) |
| ArgoCD ApplicationSets | Simple templating across clusters is enough |
| Helm + CI/CD | You don’t need self-service or multi-cluster |
| Backstage Templates | You want a UI-first approach with scaffolding |
Use Cases for Kratix
Kratix shines when you need to offer self-service capabilities across multiple clusters. Here are some practical use cases:
| Use Case | Promise | Value |
|---|---|---|
| Observability Stacks | Datadog, Loki, Tempo | On-demand monitoring for debugging/testing |
| Databases | PostgreSQL, Redis, MongoDB | Self-service database provisioning |
| Chaos Engineering | LitmusChaos, ChaosMesh | Run chaos experiments when needed |
| Load Testing | k6, Locust | Spin up load generators for performance testing |
| Security Scanning | Falco, Trivy | Enable runtime security for specific tests |
| Feature Environments | Full app stack | Temporary environments for feature testing |
| Developer Environments | IDE, tools, dependencies | Pre-configured development setups |
The common thread: capabilities that benefit from self-service provisioning, consistent configuration, and lifecycle management across multiple clusters.
Practical Example: On-Demand Datadog Stack
Let’s walk through a concrete example. We’ll build a Kratix Promise that allows SREs to install and uninstall Datadog across Kubernetes clusters on-demand.
Why This Example?
Running observability tools like Datadog across all environments 24/7 isn’t always necessary. Dev and test clusters might only need monitoring during active debugging or testing. With Kratix, SREs can:
- Install Datadog when needed for debugging
- Choose the feature tier (logs only, logs + APM, full stack)
- Remove it when done to free resources
Why This Matters: On-demand observability means you pay only for what you use. For non-production environments, this can translate to significant savings potentially hundreds of euros per month depending on your cluster sizes.
Architecture
The demo implements a single-cluster setup where the platform and workload run together:
flowchart TB
subgraph Cluster["EKS Cluster"]
subgraph Platform["Platform Components"]
Kratix[Kratix Controller]
ArgoCD[ArgoCD]
ESO[External Secrets Operator]
Flux[Flux Helm Controller]
end
subgraph Workload["Workload Namespace"]
DD[Datadog Agents<br/>on-demand]
end
end
subgraph External["External Services"]
Git[(GitHub Repository<br/>GitStateStore)]
AWS[(AWS Secrets Manager<br/>API Keys)]
end
Kratix -->|1. writes manifests| Git
ArgoCD -->|2. watches & syncs| Git
ArgoCD -->|3. deploys| Workload
ESO -->|fetches secrets| AWS
Flux -->|manages| DD
For multi-cluster setups, you would add additional Destinations pointing to workload clusters, and ArgoCD would sync to each based on labels.
The DatadogStack API
Here’s what the API looks like from the SRE perspective:
1
2
3
4
5
6
7
8
9
10
# SRE applies this to request Datadog
apiVersion: platform.srekubecraft.io/v1alpha1
kind: DatadogStack
metadata:
name: production
namespace: default
spec:
tier: full # minimal, standard, or full
environment: prod # maps to secret path: datadog/<environment>/api-keys
clusterName: kratix-demo
One simple YAML file to deploy a fully configured Datadog stack. The environment field determines which AWS Secrets Manager path to use for API keys (datadog/prod/api-keys, datadog/dev/api-keys, etc.). The tiers map to different features and cost implications:
| Tier | Features | Resource Usage | Billing Impact | Use Case |
|---|---|---|---|---|
| minimal | Basic metrics, Cluster Agent | ~256MB RAM per node | Lowest | Quick debugging |
| standard | Metrics + APM + Logs + Service Monitoring | ~512MB RAM per node | Medium (APM costs) | Testing with traces |
| full | All features (NPM, Security, Process, etc.) | ~2GB RAM per node | Highest | Production monitoring |
Request Lifecycle
sequenceDiagram
participant SRE
participant Platform as Platform Cluster
participant Kratix
participant Pipeline
participant Git as Git Repository
participant ArgoCD
participant Workload as Workload Cluster
SRE->>Platform: kubectl apply DatadogStack
Platform->>Kratix: Resource created
Kratix->>Pipeline: Execute pipeline
Pipeline->>Pipeline: Generate Helm values
Pipeline->>Pipeline: Template Datadog manifests
Pipeline->>Git: Commit manifests
Git->>ArgoCD: Change detected
ArgoCD->>Workload: Sync resources
Workload->>Workload: Deploy Datadog agents
Note over SRE,Workload: Datadog now running
SRE->>Platform: kubectl delete DatadogStack
Platform->>Kratix: Resource deleted
Kratix->>Git: Remove manifests
Git->>ArgoCD: Change detected
ArgoCD->>Workload: Prune resources
Workload->>Workload: Remove Datadog agents
Note over SRE,Workload: Datadog removed
Demo Walkthrough
Request Datadog with Full Tier:
1
2
3
4
5
# Apply the DatadogStack resource
kubectl apply -f promises/examples/full-tier.yaml
# Watch Kratix process the request
kubectl get datadogstacks -w
Output:
1
2
3
4
NAME AGE STATUS
production 0s Pending
production 5s Configuring
production 30s Ready
Watch the Pipeline Execute:
1
2
3
4
5
# The pipeline runs as a pod
kubectl get pods | grep kratix-datadog
# Check pipeline logs
kubectl logs -l kratix.io/promise-name=datadog-stack
Verify Datadog is Running:
1
2
3
4
5
# Check the namespace was created
kubectl get ns | grep datadog
# Check Datadog pods (5 containers per agent in full tier)
kubectl get pods -n datadog-production
Output:
1
2
3
4
NAME READY STATUS RESTARTS AGE
datadog-xxxxx 5/5 Running 0 2m
datadog-yyyyy 5/5 Running 0 2m
datadog-cluster-agent-zzzzz 1/1 Running 0 2m
Upgrade from Minimal to Standard Tier:
1
2
3
4
spec:
tier: standard # Changed from minimal
environment: dev
clusterName: kratix-demo
Apply the change, and Kratix re-runs the pipeline with APM and logs enabled.
Remove Datadog:
1
kubectl delete datadogstack production
Kratix removes manifests from Git, ArgoCD prunes resources, Datadog is gone.
Security Considerations
When implementing Kratix in production, consider these security aspects:
GitOps Write Permissions
Kratix pipelines write to your Git repository. Ensure:
- The Git credentials have minimal required permissions (write to specific paths only)
- Use deploy keys or service accounts rather than personal tokens
- Enable branch protection on your main branch
- Consider signed commits for audit compliance
Pipeline Container Trust
Pipeline containers execute arbitrary logic. Mitigate risks by:
- Using private container registries with image scanning
- Implementing image signing (Cosign, Notary)
- Pinning images to specific digests, not just tags
- Running pipelines with minimal RBAC permissions
Secrets Handling
For our Datadog example, API keys are managed through AWS Secrets Manager and External Secrets Operator:
- Store Datadog API keys in AWS Secrets Manager at path
datadog/<environment>/api-keys - The pipeline generates an ExternalSecret that fetches keys automatically
- Use IRSA (IAM Roles for Service Accounts) for secure AWS authentication
- Separate secrets per environment (dev, staging, prod) for isolation
- The HelmRelease references the synced Kubernetes Secret via
valuesFrom
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# ExternalSecret generated by pipeline
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: datadog-api-key
namespace: datadog-production
spec:
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
data:
- secretKey: api-key
remoteRef:
key: datadog/prod/api-keys
property: api-key
Troubleshooting Common Issues
When things go wrong, here’s where to look:
| Issue | Symptoms | Resolution |
|---|---|---|
| Pipeline Failures | DatadogStack stuck in “Configuring” | Check pipeline pod logs: kubectl logs -l kratix.io/promise-name=datadog-stack |
| InvalidImageName | Pipeline pod fails with image error | Ensure image name is lowercase (Docker registry requirement) |
| Git Write Errors | Manifests not appearing in repo | Check kubectl get workplacements -A and GitStateStore secret |
| ArgoCD OutOfSync | Resources not deployed | Force refresh: kubectl annotate app <name> -n argocd argocd.argoproj.io/refresh=hard |
| ExternalSecret Errors | Secret not synced | Check IAM policy allows access to the secret path, verify ClusterSecretStore status |
| OOMKilled Pods | Datadog agents restarting | Increase memory limits in tier values file, rebuild pipeline |
| HelmRelease/HelmRepository | API not found errors | Install Flux helm-controller and source-controller |
| Resource Conflicts | Helm errors about existing resources | Ensure namespace is clean, check for orphaned resources from previous installs |
Benefits for Platform Teams
| Benefit | Description |
|---|---|
| Self-Service Without Tickets | SREs provision what they need without waiting for approvals |
| Consistency | Every installation uses the same configuration and tagging conventions |
| Full Audit Trail | Every change is a Git commit—complete history of who did what and when |
| Resource Optimization | Install tools only when needed, remove when done |
| Reduced Overhead | Define the Promise once, SREs consume it across all environments |
Hands-On Demo Repository
I’ve built a complete demo repository that implements everything discussed in this post:
Demo Structure
flowchart TD
subgraph Repo["Demo Repository"]
TF[iac/]
K8s[kubernetes/]
Promises[promises/]
GitOps[gitops/]
GH[.github/workflows/]
end
TF -->|provisions| Infra[EKS Cluster + IAM]
K8s -->|installs| Platform[ArgoCD + Kratix + ESO + Flux]
Promises -->|defines| DD[DatadogStack Promise + Go Pipeline]
GitOps -->|stores| State[Kratix Generated Manifests]
GH -->|automates| CI[Pipeline Build + Promise Update]
What’s Included
| Directory | Contents |
|---|---|
iac/ | OpenTofu for EKS cluster, VPC, and IRSA |
kubernetes/argocd/ | ArgoCD OpenTofu setup and Application manifests |
kubernetes/kratix/ | GitStateStore and Destination configuration |
kubernetes/external-secrets/ | ClusterSecretStore and ExternalSecret definitions |
promises/ | Promise definition and example resources |
promises/pipelines/ | Go pipeline with tiered Helm values |
gitops/platform/ | Kratix-generated manifests (auto-populated) |
.github/workflows/ | CI/CD for pipeline image build and Promise updates |
Repository Architecture
The demo uses a single-cluster setup where the platform cluster also serves as the workload cluster:
- ArgoCD - GitOps controller for all deployments
- Kratix - Platform orchestrator for self-service APIs
- External Secrets Operator - Fetches Datadog API keys from AWS Secrets Manager
- Flux Helm Controller - Manages HelmReleases generated by Kratix
Quick Start
The repository includes a Taskfile for automation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Clone the repository
git clone https://github.com/NoNickeD/kratix-demo.git
cd kratix-demo
# Full setup (infrastructure + ArgoCD + platform)
task setup
# Or step by step:
task infra:apply # Deploy EKS cluster
task kubeconfig # Configure kubectl
task argocd:apply # Deploy ArgoCD
task argocd:apps # Apply ArgoCD Applications
task kratix:apply # Configure Kratix
task promise:apply # Install DatadogStack Promise
# Deploy Datadog (choose a tier)
task datadog:full # or datadog:standard, datadog:minimal
# Check status
task status
# Remove Datadog
task datadog:delete
# Full teardown
task teardown
Taskfile Commands
| Command | Description |
|---|---|
task setup | Full setup (infra + ArgoCD + platform) |
task teardown | Full teardown |
task status | Show status of all components |
task datadog:full | Deploy Datadog with full tier |
task datadog:standard | Deploy Datadog with standard tier |
task datadog:minimal | Deploy Datadog with minimal tier |
task datadog:delete | Remove all DatadogStack resources |
task argocd:password | Get ArgoCD admin password |
task argocd:port-forward | Port forward ArgoCD UI |
task pipeline:trigger | Trigger pipeline build workflow |
CI/CD Pipeline
The GitHub Actions workflow automatically:
- Lints the Go pipeline code with
golangci-lint - Builds and pushes the Docker image to
ghcr.io - Updates
promises/promise.yamlwith the new image tag - Commits the change back to the repository
This ensures the Promise always uses the latest pipeline image.
Conclusion
Kratix provides a powerful framework for platform teams to offer self-service capabilities across Kubernetes clusters. Its Kubernetes-native approach, GitOps integration, flexible pipeline system, and CLI tooling make it well-suited for organizations building internal developer platforms.
The on-demand Datadog example demonstrates how complex deployments can be abstracted behind simple APIs, giving SREs and developers the tools they need while maintaining consistency and control. The same pattern applies to databases, chaos engineering tools, load testing infrastructure, and any other capability your platform needs to offer.
Whether Kratix is right for your organization depends on your specific needs:
- Choose Kratix if you need self-service capabilities with GitOps workflows and multi-cluster support
- Choose Crossplane if your focus is cloud resource provisioning
- Choose ArgoCD ApplicationSets if simple templating across clusters is enough
- Combine them if you need both cloud resources and platform capabilities
The key takeaway: platform engineering is about reducing friction while maintaining guardrails. Tools like Kratix help achieve that balance.
If you found this useful, you might also enjoy my related posts on observability and platform tooling:
