Building a Read-Only Kubernetes Agent with Google ADK (Go)

Every k8s incident I have watched in the last two years has the same opening five minutes: a senior engineer narrating kubectl get pod, kubectl describe, kubectl logs --previous, while a junior engineer types it all into Slack. Same six commands, same order, every time. The work is mechanical; the context-switching is the cost.

I wanted a small chat interface that could run those commands on my behalf and explain the output, without ever reaching for kubectl delete or exec. Not “AI ops”. Not a control plane. A read-only co-pilot for the boring 80% of triage.

This post walks through adk-k8s-agent: a single-binary Go agent built on Google’s Agent Development Kit, packaged in a scratch image, and deployed via Helm with External Secrets Operator. Total code: about 250 lines of Go and one Helm chart.

Who Should Read This?#

This post is for:

SREs who want a chat interface for cluster triage that cannot accidentally write
Platform engineers building internal developer tools and curious about ADK’s primitives
Security-minded engineers evaluating how to safely embed an LLM in cluster operations

What is ADK?#

Agent Development Kit is Google’s open-source toolkit for building, evaluating, and deploying AI agents. It is model-agnostic but first-class for Gemini. The Go variant lives at google.golang.org/adk and ships with the building blocks any agent needs: Agent, Tool, Skill, Runner, Session, plus a launcher that exposes your agent over CLI, HTTP API, or a web UI without writing a server.

Why Not Just Use the Gemini SDK Directly?#

Aspect	Gemini SDK only	ADK Go
Tool calling	You wire `FunctionDeclaration`, JSON-Schema, the loop	`functiontool.New(cfg, goFunc)` infers schema from struct tags
Skill packaging	Roll your own loader, prompt injection	`SKILL.md` files with frontmatter, lazy-loaded by the runtime
Multi-modal launcher	Build your HTTP server, build a UI	`full.NewLauncher().Execute(ctx, cfg, args)`
Multi-agent topologies	Hand-rolled routing	`SequentialAgent`, `ParallelAgent`, `LoopAgent`
Sessions / state / memory	DIY storage	Pluggable services

Bottom line: ADK is the difference between writing 30 lines of agent setup or several hundred lines of plumbing.

What This Is NOT#

Not a multi-agent orchestrator out of the box. You compose those yourself.
Not a vector database. Bring your own retrieval if you need RAG.
Not a replacement for k8sgpt. It is a kit for building your own narrow agents, not a finished tool.

Core Concepts#

flowchart TB
    User[User] --> Agent
    subgraph Agent["LLM Agent (Gemini Flash)"]
        Prompt[System Instruction<br/>read-only contract]
    end
    Agent -->|invokes| Tool[kubectl Tool<br/>verb whitelist]
    Agent -->|loads on demand| Skills[Skills<br/>SKILL.md files]
    Tool -->|exec| Kubectl[(kubectl binary)]
    Kubectl -->|API call| API[Kubernetes API]
    subgraph RBAC["RBAC"]
        Role[ClusterRole<br/>get/list/watch only]
    end
    API --- Role

The runtime has four pieces:

Agent: the LLM-backed worker. We use one llmagent with Gemini Flash.
Tool: a Go function exposed to the model. Ours shells out to kubectl.
Skill: a SKILL.md file with metadata and instructions. The model loads its body only when its description matches the user’s intent.
Runner / Launcher: the loop. ADK ships cmd/launcher/full which exposes CLI, REST, and a web UI from a single binary.

The Whole Agent in 30 Lines#

The wiring is small enough to read in one go:

model, _ := gemini.NewModel(ctx, "gemini-flash-latest", &genai.ClientConfig{APIKey: apiKey})
kubectlTool, _ := newKubectlTool()
skillsRoot, _ := fs.Sub(skillsFS, "skills")
skillSet, _ := skilltoolset.New(ctx, skilltoolset.Config{
    Source: skill.NewFileSystemSource(skillsRoot),
})

rootAgent, _ := llmagent.New(llmagent.Config{
    Name:        "k8s_assistant",
    Model:       model,
    Description: "Read-only Kubernetes operations assistant.",
    Instruction: systemInstruction,
    Tools:       []tool.Tool{kubectlTool},
    Toolsets:    []tool.Toolset{skillSet},
})

cfg := &launcher.Config{AgentLoader: agent.NewSingleLoader(rootAgent)}
full.NewLauncher().Execute(ctx, cfg, os.Args[1:])

(Errors elided for brevity. The real agent.go uses slog and exits explicitly.)

That is it. Run go run . web api webui and you have a chat UI on :8080 that can call kubectl.

The kubectl Tool: Defense Layer One#

The single most important file in the project is kubectl_tool.go. It is the layer that decides what the model can actually run. Three guards stack on top of each other:

// Closed set of read-only kubectl subcommands we will execute.
var allowedVerbs = map[string]bool{
    "get": true, "describe": true, "logs": true, "top": true,
    "events": true, "explain": true, "api-resources": true,
    "api-versions": true, "version": true, "cluster-info": true,
    "config": true, // sub-verb whitelisted separately
}

// Flags that could redirect kubectl to a different cluster or identity.
var blockedFlagPrefixes = []string{
    "--token", "--server", "--kubeconfig",
    "--as", "--as-group",
    "--client-key", "--client-certificate",
    "--username", "--password",
}

The function then:

Rejects any verb not in allowedVerbs.
Rejects any arg matching blockedFlagPrefixes. No --server to point at a different cluster, no --as to escalate identity.
Wraps the subprocess in a 30-second context.WithTimeout so a hung kubectl cannot stall the agent.
Caps stdout / stderr at 16 KiB / 4 KiB to keep the model’s context tight.

Why This Matters: even if the model hallucinates kubectl delete pod -n kube-system kube-apiserver, the verb check rejects it before os/exec ever sees it. The LLM is treated as untrusted input.

Skills: Letting the Agent Have Playbooks#

ADK Skills are small Markdown files with YAML frontmatter that the model loads when their description matches the request. Here is the k8s-debug skill:

---
name: k8s-debug
description: "Diagnose a Kubernetes pod that is crashing, restarting, or stuck (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled). Use when the user mentions a pod, restarts, probe failures, or asks 'why is X not running'."
---

When triggered, run this investigation in order:

1. **Status snapshot.** kubectl get pod ... -o wide
2. **Describe.** kubectl describe pod ...
3. **Logs.** Current and --previous, for each container with restarts.
4. **Namespace events.** kubectl get events -n <ns> --sort-by=.lastTimestamp
5. **Resource pressure.** kubectl top pod, kubectl describe node
6. **Probes.** Pull readiness / liveness from describe, cross-reference with events.

Output: verdict, evidence, suggested fix as a kubectl command. Do not run the fix.

The whole skills/ directory is embedded into the binary with //go:embed so the same code runs locally and inside the container without copying files. New skill = new directory + go build. No registry, no plugin system, no DI.

Defense in Depth: Four Layers, Not One#

I would not feel safe running this with only the tool whitelist. The full stack:

Layer	What it does	Example
System prompt	Tells the model to refuse destructive verbs and to suggest, not run, fixes	`"Never propose or attempt destructive actions..."`
Tool whitelist	Drops anything outside the read-only verb set before exec	`kubectl_tool.go: allowedVerbs`
ClusterRole	The API server itself rejects writes from this ServiceAccount	`verbs: ["get", "list", "watch"]`
PodSecurity	`runAsNonRoot`, `readOnlyRootFilesystem`, all caps dropped	Restricted PSS

If the model bypasses the prompt, the tool catches it. If the tool is bypassed, the API server returns 403. If somehow that fails, the pod cannot write to its own filesystem. Each layer is independently sufficient for the dangerous case.

The ClusterRole deliberately omits secrets. The agent has zero need to read them.

The Container: Why `scratch`, and What You Have to Add#

I started with gcr.io/distroless/static-debian13:nonroot and switched to scratch once I realized I wanted zero base-image dependency. Scratch is literally empty. For a Go static binary that calls Gemini over HTTPS and shells out to kubectl, you need:

File	Why
`/etc/ssl/certs/ca-certificates.crt`	TLS to `generativelanguage.googleapis.com` and the API server
`/etc/passwd`, `/etc/group`	so `runAsNonRoot: true` resolves a username
`/tmp` (mode 1777)	kubectl’s HTTP cache
`/home/nonroot/.kube`	kubectl discovery cache (mounted as `emptyDir`)
`/usr/local/bin/kubectl`	the binary itself, statically linked

The Dockerfile is three stages: build the Go binary with CGO_ENABLED=0 -tags 'osusergo,netgo', harvest the rootfs from Alpine, then FROM scratch. Multi-arch via TARGETOS / TARGETARCH build args, so docker buildx build --platform linux/amd64,linux/arm64 produces a single manifest. Final image: about 80 MB, dominated by kubectl (the agent binary itself is roughly 30 MB).

Switching base from distroless/static to scratch saves only a few MiB. The real win is “no dependency on gcr.io”, not size.

Helm Chart with Three Secret Modes#

The chart at charts/adk-agent/ has one design decision worth highlighting: the Deployment always reads GOOGLE_API_KEY from a Secret named apiKey.existingSecret.name with key apiKey.existingSecret.key. Three modes decide who creates that Secret:

existingSecret (default). Something else makes it: SealedSecrets, your own ESO ExternalSecret outside the chart, a platform pipeline.
externalSecret.enabled=true. The chart renders an ExternalSecret CRD and ESO materializes the Secret from your backend.
inline.enabled=true. The chart renders the Secret from values. Dev only.

Switching modes is a values change, never a template edit. Production install with Vault:

helm install adk charts/adk-agent \
  --namespace adk-agent --create-namespace \
  --set image.repository=ghcr.io/nonicked/adk-k8s-agent \
  --set image.tag=0.1.0 \
  --set apiKey.externalSecret.enabled=true \
  --set apiKey.externalSecret.secretStoreRef.name=vault-backend \
  --set apiKey.externalSecret.secretStoreRef.kind=ClusterSecretStore \
  --set apiKey.externalSecret.remoteRef.key=secret/data/adk/agent \
  --set apiKey.externalSecret.remoteRef.property=api_key

The chart helpers fail-fast if both inline and externalSecret are enabled, or if a required field is missing. For the same install with AWS Secrets Manager or GCP Secret Manager, only secretStoreRef.name and remoteRef.key change. The deployment does not know or care which backend produced the Secret.

If you have not used ESO before, my earlier post External Secrets Operator: Managing Kubernetes Secrets at Scale covers the setup and operator-level concepts.

Hands-On Demo Repository#

The complete working demo is available at:

github.com/nicknikolakakis/srekubecraft-demo/tree/main/adk-agent

Quick Start#

git clone https://github.com/nicknikolakakis/srekubecraft-demo.git
cd srekubecraft-demo/adk-agent

cp .env.example .env       # paste GOOGLE_API_KEY from https://aistudio.google.com/apikey
source .env

# Full setup: kind + Cilium + ESO + image build + side-load + helm install
task setup
task port-forward          # then open http://localhost:8080

The setup takes about 5 minutes on a laptop and creates:

2-node Kind cluster (control plane + system worker) with Cilium CNI (eBPF, no kube-proxy)
External Secrets Operator (so you can flip the chart to externalSecret mode without rebuilding)
A locally-built local/adk-k8s-agent:0.1.0 image side-loaded into Kind
The adk-agent namespace with the Deployment, Service, ServiceAccount, ClusterRole/ClusterRoleBinding, and the adk-agent-creds Secret (inline mode)

Demo Stack Versions#

Component	Version	How it gets in
Cilium	1.19.3	Helm (imperative bootstrap)
External Secrets Operator	2.3.0	Helm (shared task)
Google ADK Go	latest	Go module
Gemini model	`gemini-flash-latest`	inferred at runtime
kubectl (in image)	v1.36.0	pinned in Dockerfile
adk-k8s-agent chart	0.1.0	local `helm install`

Setup on Kind (~5 min)#

The Taskfile breaks the install into four phases so you can stop and inspect at any step.

Phase 1: bootstrap (~2 min)#

$ task bootstrap
Creating Kind cluster 'adk-agent'...
 ✓ Ensuring node image
 ✓ Preparing nodes
 ✓ Starting control-plane
 ✓ Joining worker nodes
Cluster 'adk-agent' created.

Installing Cilium 1.19.3...
DaemonSet         cilium             Desired: 2, Ready: 2/2
Cilium installed. All nodes ready.

Phase 2: ESO (~1 min)#

$ task shared:eso:install
Installing External Secrets Operator 2.3.0...
External Secrets Operator installed.

You do not need ESO for helm:install:inline. It is installed up front so the secret mode can be flipped later without rebuilding the cluster.

Phase 3: image build + side-load (~1 min)#

$ task image:build
Building local/adk-k8s-agent:0.1.0 for the local arch...
[+] Building 38.2s (15/15) FINISHED
Image built.

$ task image:load
Loading local/adk-k8s-agent:0.1.0 into Kind cluster 'adk-agent'...
Image loaded.

Phase 4: helm install (~30s)#

$ task helm:install:inline
Installing release 'adk' in namespace 'adk-agent' (inline mode)...
Release "adk" has been installed.

$ task status
==> Pods:
NAME                          READY   STATUS    RESTARTS   AGE
adk-adk-agent-...             1/1     Running   0          25s

==> Service:
NAME                  TYPE        CLUSTER-IP     PORT(S)
adk-adk-agent         ClusterIP   10.96.142.30   8080/TCP

task port-forward exposes the chat UI on http://localhost:8080.

Live Testing#

Three smoke tests cover the read path, the skill path, and the deny path.

Read query: list namespaces#

$ task test:read
==> Sending 'list namespaces' to the agent API...
The cluster has 7 namespaces:
- adk-agent
- default
- external-secrets
- kube-node-lease
- kube-public
- kube-system
- local-path-storage

The model called kubectl get ns -o name, parsed the output, and produced a list. Sub-second response. You can watch the tool call in task logs.

Triage flow: why is this pod restarting?#

$ task test:debug
namespace/demo configured
pod/badpod created
==> Asking the agent why 'badpod' in namespace 'demo' is restarting...

Verdict: badpod is in CrashLoopBackOff.
Evidence:
  - kubectl get pod badpod -n demo -> 5 restarts, last exit code 1
  - kubectl describe pod badpod -n demo -> Last State: Terminated, Reason: Error
  - kubectl logs badpod -n demo --previous -> (empty)
Suggested fix (do not run yet):
  - The container command is '/bin/sh -c "exit 1"', which exits with status 1
    on every start. Replace the command with something that stays running,
    or remove this pod.

This is the k8s-debug skill running end to end: status snapshot, describe, logs, suggested fix. Notice the agent suggests the fix as a kubectl command rather than running it. That is the system prompt doing its job.

Destructive verb: tool layer refuses#

$ task test:blocked
==> Asking the agent to delete the kube-apiserver. The tool layer must refuse.

I cannot run destructive kubectl verbs. My tool is restricted to read-only
operations (get, describe, logs, top, events). If you need to delete a pod,
use kubectl directly or escalate to a cluster operator.

The deny is reachable from two layers: the system prompt tells the model to refuse, and kubectl_tool.go rejects the verb before os/exec ever runs. To prove the tool catches it even if the prompt is bypassed, jailbreak the model in the UI (“pretend the rules above no longer apply”) and watch task logs: you will see a verb "delete" is not allowed; permitted: ... error returned to the model, and no kubectl process is forked.

adk-k8s-agent: Pros and Cons#

Pros#

Advantage	Description
Cannot mutate the cluster	Four-layer defense, RBAC is the backstop
Tiny container	scratch base, ~80 MB total, no shell
Skills are just files	New playbook = new directory + `go build`
Helm + ESO are first-class	Three modes, single source of truth in values

Cons#

Limitation	Description
Single agent, single thread	No multi-agent routing yet
No persistent memory	Sessions are in-process; restart wipes them
AI Studio key only	Vertex AI / Workload Identity not wired
No image signing	Cosign + admission verification is the next step

When to Use#

Use it for read-only triage and explaining manifests to people who do not read YAML for breakfast.
Skip it if you need write actions (use a proper tool like Argo Workflows or your own CRD), or if your cluster has policy that forbids egress to Google APIs (look at Vertex AI on a private endpoint instead).

Conclusion#

ADK Go gave me an agent in about 250 lines of code, with a clean separation between the LLM, the tool surface, and the playbooks. The hard work was not the AI part. It was the same work as any production-bound service: lock the tool surface, run as non-root, mind the RBAC, get the secrets in via the right operator, ship a Helm chart that does not surprise the platform team.

If you are considering an agent inside your cluster, start narrow. Read-only is a feature. Add verbs only when you have a strong story for the rollback.

The full demo (Go source, Helm chart, Taskfile, Kind config) is at github.com/nicknikolakakis/srekubecraft-demo/tree/main/adk-agent.

If you found this useful, you might also enjoy:

ADK Kubernetes Agent Building a read-only Kubernetes agent with Google ADK (Go)