Building a Read-Only Kubernetes Agent with Google ADK (Go)
// How to build a safe, read-only Kubernetes operations assistant in Go using Google's Agent Development Kit, packaged in a scratch image and deployed via Helm with External Secrets Operator.
Every k8s incident I have watched in the last two years has the same opening five minutes: a senior engineer narrating kubectl get pod, kubectl describe, kubectl logs --previous, while a junior engineer types it all into Slack. Same six commands, same order, every time. The work is mechanical; the context-switching is the cost.
I wanted a small chat interface that could run those commands on my behalf and explain the output, without ever reaching for kubectl delete or exec. Not “AI ops”. Not a control plane. A read-only co-pilot for the boring 80% of triage.
This post walks through adk-k8s-agent: a single-binary Go agent built on Google’s Agent Development Kit, packaged in a scratch image, and deployed via Helm with External Secrets Operator. Total code: about 250 lines of Go and one Helm chart.
Who Should Read This?#
This post is for:
- SREs who want a chat interface for cluster triage that cannot accidentally write
- Platform engineers building internal developer tools and curious about ADK’s primitives
- Security-minded engineers evaluating how to safely embed an LLM in cluster operations
What is ADK?#
Agent Development Kit is Google’s open-source toolkit for building, evaluating, and deploying AI agents. It is model-agnostic but first-class for Gemini. The Go variant lives at google.golang.org/adk and ships with the building blocks any agent needs: Agent, Tool, Skill, Runner, Session, plus a launcher that exposes your agent over CLI, HTTP API, or a web UI without writing a server.
Why Not Just Use the Gemini SDK Directly?#
| Aspect | Gemini SDK only | ADK Go |
|---|---|---|
| Tool calling | You wire FunctionDeclaration, JSON-Schema, the loop | functiontool.New(cfg, goFunc) infers schema from struct tags |
| Skill packaging | Roll your own loader, prompt injection | SKILL.md files with frontmatter, lazy-loaded by the runtime |
| Multi-modal launcher | Build your HTTP server, build a UI | full.NewLauncher().Execute(ctx, cfg, args) |
| Multi-agent topologies | Hand-rolled routing | SequentialAgent, ParallelAgent, LoopAgent |
| Sessions / state / memory | DIY storage | Pluggable services |
Bottom line: ADK is the difference between writing 30 lines of agent setup or several hundred lines of plumbing.
What This Is NOT#
- Not a multi-agent orchestrator out of the box. You compose those yourself.
- Not a vector database. Bring your own retrieval if you need RAG.
- Not a replacement for k8sgpt. It is a kit for building your own narrow agents, not a finished tool.
Core Concepts#
flowchart TB
User[User] --> Agent
subgraph Agent["LLM Agent (Gemini Flash)"]
Prompt[System Instruction<br/>read-only contract]
end
Agent -->|invokes| Tool[kubectl Tool<br/>verb whitelist]
Agent -->|loads on demand| Skills[Skills<br/>SKILL.md files]
Tool -->|exec| Kubectl[(kubectl binary)]
Kubectl -->|API call| API[Kubernetes API]
subgraph RBAC["RBAC"]
Role[ClusterRole<br/>get/list/watch only]
end
API --- Role
The runtime has four pieces:
- Agent: the LLM-backed worker. We use one
llmagentwith Gemini Flash. - Tool: a Go function exposed to the model. Ours shells out to
kubectl. - Skill: a
SKILL.mdfile with metadata and instructions. The model loads its body only when its description matches the user’s intent. - Runner / Launcher: the loop. ADK ships
cmd/launcher/fullwhich exposes CLI, REST, and a web UI from a single binary.
The Whole Agent in 30 Lines#
The wiring is small enough to read in one go:
model, _ := gemini.NewModel(ctx, "gemini-flash-latest", &genai.ClientConfig{APIKey: apiKey})
kubectlTool, _ := newKubectlTool()
skillsRoot, _ := fs.Sub(skillsFS, "skills")
skillSet, _ := skilltoolset.New(ctx, skilltoolset.Config{
Source: skill.NewFileSystemSource(skillsRoot),
})
rootAgent, _ := llmagent.New(llmagent.Config{
Name: "k8s_assistant",
Model: model,
Description: "Read-only Kubernetes operations assistant.",
Instruction: systemInstruction,
Tools: []tool.Tool{kubectlTool},
Toolsets: []tool.Toolset{skillSet},
})
cfg := &launcher.Config{AgentLoader: agent.NewSingleLoader(rootAgent)}
full.NewLauncher().Execute(ctx, cfg, os.Args[1:])
(Errors elided for brevity. The real agent.go uses slog and exits explicitly.)
That is it. Run go run . web api webui and you have a chat UI on :8080 that can call kubectl.
The kubectl Tool: Defense Layer One#
The single most important file in the project is kubectl_tool.go. It is the layer that decides what the model can actually run. Three guards stack on top of each other:
// Closed set of read-only kubectl subcommands we will execute.
var allowedVerbs = map[string]bool{
"get": true, "describe": true, "logs": true, "top": true,
"events": true, "explain": true, "api-resources": true,
"api-versions": true, "version": true, "cluster-info": true,
"config": true, // sub-verb whitelisted separately
}
// Flags that could redirect kubectl to a different cluster or identity.
var blockedFlagPrefixes = []string{
"--token", "--server", "--kubeconfig",
"--as", "--as-group",
"--client-key", "--client-certificate",
"--username", "--password",
}
The function then:
- Rejects any verb not in
allowedVerbs. - Rejects any arg matching
blockedFlagPrefixes. No--serverto point at a different cluster, no--asto escalate identity. - Wraps the subprocess in a 30-second
context.WithTimeoutso a hung kubectl cannot stall the agent. - Caps stdout / stderr at 16 KiB / 4 KiB to keep the model’s context tight.
Why This Matters: even if the model hallucinates
kubectl delete pod -n kube-system kube-apiserver, the verb check rejects it beforeos/execever sees it. The LLM is treated as untrusted input.
Skills: Letting the Agent Have Playbooks#
ADK Skills are small Markdown files with YAML frontmatter that the model loads when their description matches the request. Here is the k8s-debug skill:
---
name: k8s-debug
description: "Diagnose a Kubernetes pod that is crashing, restarting, or stuck (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled). Use when the user mentions a pod, restarts, probe failures, or asks 'why is X not running'."
---
When triggered, run this investigation in order:
1. **Status snapshot.** kubectl get pod ... -o wide
2. **Describe.** kubectl describe pod ...
3. **Logs.** Current and --previous, for each container with restarts.
4. **Namespace events.** kubectl get events -n <ns> --sort-by=.lastTimestamp
5. **Resource pressure.** kubectl top pod, kubectl describe node
6. **Probes.** Pull readiness / liveness from describe, cross-reference with events.
Output: verdict, evidence, suggested fix as a kubectl command. Do not run the fix.
The whole skills/ directory is embedded into the binary with //go:embed so the same code runs locally and inside the container without copying files. New skill = new directory + go build. No registry, no plugin system, no DI.
Defense in Depth: Four Layers, Not One#
I would not feel safe running this with only the tool whitelist. The full stack:
| Layer | What it does | Example |
|---|---|---|
| System prompt | Tells the model to refuse destructive verbs and to suggest, not run, fixes | "Never propose or attempt destructive actions..." |
| Tool whitelist | Drops anything outside the read-only verb set before exec | kubectl_tool.go: allowedVerbs |
| ClusterRole | The API server itself rejects writes from this ServiceAccount | verbs: ["get", "list", "watch"] |
| PodSecurity | runAsNonRoot, readOnlyRootFilesystem, all caps dropped | Restricted PSS |
If the model bypasses the prompt, the tool catches it. If the tool is bypassed, the API server returns 403. If somehow that fails, the pod cannot write to its own filesystem. Each layer is independently sufficient for the dangerous case.
The ClusterRole deliberately omits secrets. The agent has zero need to read them.
The Container: Why scratch, and What You Have to Add#
I started with gcr.io/distroless/static-debian13:nonroot and switched to scratch once I realized I wanted zero base-image dependency. Scratch is literally empty. For a Go static binary that calls Gemini over HTTPS and shells out to kubectl, you need:
| File | Why |
|---|---|
/etc/ssl/certs/ca-certificates.crt | TLS to generativelanguage.googleapis.com and the API server |
/etc/passwd, /etc/group | so runAsNonRoot: true resolves a username |
/tmp (mode 1777) | kubectl’s HTTP cache |
/home/nonroot/.kube | kubectl discovery cache (mounted as emptyDir) |
/usr/local/bin/kubectl | the binary itself, statically linked |
The Dockerfile is three stages: build the Go binary with CGO_ENABLED=0 -tags 'osusergo,netgo', harvest the rootfs from Alpine, then FROM scratch. Multi-arch via TARGETOS / TARGETARCH build args, so docker buildx build --platform linux/amd64,linux/arm64 produces a single manifest. Final image: about 80 MB, dominated by kubectl (the agent binary itself is roughly 30 MB).
Switching base from distroless/static to scratch saves only a few MiB. The real win is “no dependency on gcr.io”, not size.
Helm Chart with Three Secret Modes#
The chart at charts/adk-agent/ has one design decision worth highlighting: the Deployment always reads GOOGLE_API_KEY from a Secret named apiKey.existingSecret.name with key apiKey.existingSecret.key. Three modes decide who creates that Secret:
existingSecret(default). Something else makes it: SealedSecrets, your own ESOExternalSecretoutside the chart, a platform pipeline.externalSecret.enabled=true. The chart renders anExternalSecretCRD and ESO materializes the Secret from your backend.inline.enabled=true. The chart renders the Secret from values. Dev only.
Switching modes is a values change, never a template edit. Production install with Vault:
helm install adk charts/adk-agent \
--namespace adk-agent --create-namespace \
--set image.repository=ghcr.io/nonicked/adk-k8s-agent \
--set image.tag=0.1.0 \
--set apiKey.externalSecret.enabled=true \
--set apiKey.externalSecret.secretStoreRef.name=vault-backend \
--set apiKey.externalSecret.secretStoreRef.kind=ClusterSecretStore \
--set apiKey.externalSecret.remoteRef.key=secret/data/adk/agent \
--set apiKey.externalSecret.remoteRef.property=api_key
The chart helpers fail-fast if both inline and externalSecret are enabled, or if a required field is missing. For the same install with AWS Secrets Manager or GCP Secret Manager, only secretStoreRef.name and remoteRef.key change. The deployment does not know or care which backend produced the Secret.
If you have not used ESO before, my earlier post External Secrets Operator: Managing Kubernetes Secrets at Scale covers the setup and operator-level concepts.
Hands-On Demo Repository#
The complete working demo is available at:
github.com/nicknikolakakis/srekubecraft-demo/tree/main/adk-agent
Quick Start#
git clone https://github.com/nicknikolakakis/srekubecraft-demo.git
cd srekubecraft-demo/adk-agent
cp .env.example .env # paste GOOGLE_API_KEY from https://aistudio.google.com/apikey
source .env
# Full setup: kind + Cilium + ESO + image build + side-load + helm install
task setup
task port-forward # then open http://localhost:8080
The setup takes about 5 minutes on a laptop and creates:
- 2-node Kind cluster (control plane + system worker) with Cilium CNI (eBPF, no kube-proxy)
- External Secrets Operator (so you can flip the chart to
externalSecretmode without rebuilding) - A locally-built
local/adk-k8s-agent:0.1.0image side-loaded into Kind - The
adk-agentnamespace with the Deployment, Service, ServiceAccount, ClusterRole/ClusterRoleBinding, and theadk-agent-credsSecret (inline mode)
Demo Stack Versions#
| Component | Version | How it gets in |
|---|---|---|
| Cilium | 1.19.3 | Helm (imperative bootstrap) |
| External Secrets Operator | 2.3.0 | Helm (shared task) |
| Google ADK Go | latest | Go module |
| Gemini model | gemini-flash-latest | inferred at runtime |
| kubectl (in image) | v1.36.0 | pinned in Dockerfile |
| adk-k8s-agent chart | 0.1.0 | local helm install |
Setup on Kind (~5 min)#
The Taskfile breaks the install into four phases so you can stop and inspect at any step.
Phase 1: bootstrap (~2 min)#
$ task bootstrap
Creating Kind cluster 'adk-agent'...
✓ Ensuring node image
✓ Preparing nodes
✓ Starting control-plane
✓ Joining worker nodes
Cluster 'adk-agent' created.
Installing Cilium 1.19.3...
DaemonSet cilium Desired: 2, Ready: 2/2
Cilium installed. All nodes ready.
Phase 2: ESO (~1 min)#
$ task shared:eso:install
Installing External Secrets Operator 2.3.0...
External Secrets Operator installed.
You do not need ESO for helm:install:inline. It is installed up front so the secret mode can be flipped later without rebuilding the cluster.
Phase 3: image build + side-load (~1 min)#
$ task image:build
Building local/adk-k8s-agent:0.1.0 for the local arch...
[+] Building 38.2s (15/15) FINISHED
Image built.
$ task image:load
Loading local/adk-k8s-agent:0.1.0 into Kind cluster 'adk-agent'...
Image loaded.
Phase 4: helm install (~30s)#
$ task helm:install:inline
Installing release 'adk' in namespace 'adk-agent' (inline mode)...
Release "adk" has been installed.
$ task status
==> Pods:
NAME READY STATUS RESTARTS AGE
adk-adk-agent-... 1/1 Running 0 25s
==> Service:
NAME TYPE CLUSTER-IP PORT(S)
adk-adk-agent ClusterIP 10.96.142.30 8080/TCP
task port-forward exposes the chat UI on http://localhost:8080.
Live Testing#
Three smoke tests cover the read path, the skill path, and the deny path.
Read query: list namespaces#
$ task test:read
==> Sending 'list namespaces' to the agent API...
The cluster has 7 namespaces:
- adk-agent
- default
- external-secrets
- kube-node-lease
- kube-public
- kube-system
- local-path-storage
The model called kubectl get ns -o name, parsed the output, and produced a list. Sub-second response. You can watch the tool call in task logs.
Triage flow: why is this pod restarting?#
$ task test:debug
namespace/demo configured
pod/badpod created
==> Asking the agent why 'badpod' in namespace 'demo' is restarting...
Verdict: badpod is in CrashLoopBackOff.
Evidence:
- kubectl get pod badpod -n demo -> 5 restarts, last exit code 1
- kubectl describe pod badpod -n demo -> Last State: Terminated, Reason: Error
- kubectl logs badpod -n demo --previous -> (empty)
Suggested fix (do not run yet):
- The container command is '/bin/sh -c "exit 1"', which exits with status 1
on every start. Replace the command with something that stays running,
or remove this pod.
This is the k8s-debug skill running end to end: status snapshot, describe, logs, suggested fix. Notice the agent suggests the fix as a kubectl command rather than running it. That is the system prompt doing its job.
Destructive verb: tool layer refuses#
$ task test:blocked
==> Asking the agent to delete the kube-apiserver. The tool layer must refuse.
I cannot run destructive kubectl verbs. My tool is restricted to read-only
operations (get, describe, logs, top, events). If you need to delete a pod,
use kubectl directly or escalate to a cluster operator.
The deny is reachable from two layers: the system prompt tells the model to refuse, and kubectl_tool.go rejects the verb before os/exec ever runs. To prove the tool catches it even if the prompt is bypassed, jailbreak the model in the UI (“pretend the rules above no longer apply”) and watch task logs: you will see a verb "delete" is not allowed; permitted: ... error returned to the model, and no kubectl process is forked.
adk-k8s-agent: Pros and Cons#
Pros#
| Advantage | Description |
|---|---|
| Cannot mutate the cluster | Four-layer defense, RBAC is the backstop |
| Tiny container | scratch base, ~80 MB total, no shell |
| Skills are just files | New playbook = new directory + go build |
| Helm + ESO are first-class | Three modes, single source of truth in values |
Cons#
| Limitation | Description |
|---|---|
| Single agent, single thread | No multi-agent routing yet |
| No persistent memory | Sessions are in-process; restart wipes them |
| AI Studio key only | Vertex AI / Workload Identity not wired |
| No image signing | Cosign + admission verification is the next step |
When to Use#
- Use it for read-only triage and explaining manifests to people who do not read YAML for breakfast.
- Skip it if you need write actions (use a proper tool like Argo Workflows or your own CRD), or if your cluster has policy that forbids egress to Google APIs (look at Vertex AI on a private endpoint instead).
Conclusion#
ADK Go gave me an agent in about 250 lines of code, with a clean separation between the LLM, the tool surface, and the playbooks. The hard work was not the AI part. It was the same work as any production-bound service: lock the tool surface, run as non-root, mind the RBAC, get the secrets in via the right operator, ship a Helm chart that does not surprise the platform team.
If you are considering an agent inside your cluster, start narrow. Read-only is a feature. Add verbs only when you have a strong story for the rollback.
The full demo (Go source, Helm chart, Taskfile, Kind config) is at github.com/nicknikolakakis/srekubecraft-demo/tree/main/adk-agent.
If you found this useful, you might also enjoy:
- External Secrets Operator: Managing Kubernetes Secrets at Scale
- K8sGPT: The AI Solution to Streamline Kubernetes Operations?
Building a read-only Kubernetes agent with Google ADK (Go)
