KServe - Production ML Serving on Kubernetes, from sklearn to LLMs
// How to use KServe v0.18 in Serverless mode to serve both classical ML models and LLMs on Kubernetes - with scale-to-zero, OpenAI-compatible APIs, Flux GitOps, and a real demo on Apple Silicon.
In November 2025, KServe joined the CNCF as an incubating project. For a project that started life as KFServing inside Kubeflow back in 2019, this was the formal recognition that ML model serving on Kubernetes had grown up. KServe is now the closest thing the cloud-native ecosystem has to a standard for putting trained models behind an API: scikit-learn, XGBoost, PyTorch, TensorFlow, ONNX, Triton, and increasingly Large Language Models all served through the same InferenceService CRD with the same scale-to-zero, autoscaling, traffic splitting, and observability primitives.
This post walks through what KServe is, what’s new in v0.18, and how to run a real two-model demo on a laptop: a scikit-learn iris classifier (predictive ML, scales to zero) and a Qwen 2.5 0.5B Instruct LLM served through Ollama with an OpenAI-compatible chat API. Everything runs on a Kind cluster on Apple Silicon - no GPU required - and is delivered via Flux GitOps with cert-manager, Istio, Knative Serving, and metrics-server. The full demo lives in srekubecraft-demo/kserve/.
Who Should Read This?#
This post is for:
- Platform Engineers standing up an internal ML serving platform on Kubernetes
- SREs who need to operate model serving like any other production service - probes, metrics, autoscaling, GitOps
- ML Platform Teams evaluating KServe vs alternatives like Seldon Core, BentoML, or rolling their own
- Engineers exploring GenAI on Kubernetes who want to self-host LLMs behind an OpenAI-compatible API instead of paying for hosted inference
- Anyone who saw the CNCF announcement and wants to know what KServe actually does in practice
TL;DR#
Problem: Trained ML models need to be served behind an API. Doing this well at scale means request batching, autoscaling, scale-to-zero, traffic splitting, model versioning, observability, and a unified API across model frameworks. Building this yourself is months of work.
Solution: KServe v0.18 provides an InferenceService CRD that wraps Knative + Istio and serves any of a dozen model formats out of the box. For LLMs it speaks OpenAI’s chat-completions API. For predictive ML it speaks the v1 protocol (:predict). The same controller does both.
Result: A single CRD, two real workloads side by side: sklearn iris classifier scales to zero between requests, Qwen 0.5B LLM stays warm and answers chat completions. Total time on a laptop: ~20 minutes from task setup to live chat. Full demo repo.
The ML Serving Problem#
Putting a trained model behind an API sounds simple. You serialize the model, write a Flask app, expose /predict, deploy it. Done. Until you’re not.
Reality at scale:
- Resource efficiency: a model that gets one request per minute should not run 24/7 on a GPU node. But cold-starting a 7B-parameter LLM takes 30 seconds. So when do you scale down?
- Multiple model formats: your data scientists train in PyTorch, your ETL team uses sklearn, your forecasting team uses XGBoost. Three different serving stacks, three different deployment patterns, three different metric pipelines.
- Autoscaling on the right metric: LLM serving needs autoscaling on token throughput and queue depth, not CPU. Predictive ML autoscales fine on RPS.
- A/B testing and canary rollouts: you want to ship a new model version to 5% of traffic without writing custom routing logic.
- Observability: request count, latency, model load time, prediction histograms - per model, with no extra code in the model server.
- Standard APIs: the world has settled on OpenAI’s chat-completions API for LLMs and a
/v1/models/<name>:predictshape for predictive ML. Your serving layer should speak both.
KServe’s pitch is that all of the above is plumbing the platform team should provide once. Data scientists ship models, the platform serves them.
What is KServe?#
KServe is a Kubernetes Custom Resource Definition (InferenceService) plus a controller that creates everything underneath it - Knative Service, Pods, Services, autoscaler config, ingress routes - based on the model format you declare.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: my-model
spec:
predictor:
minReplicas: 0
maxReplicas: 10
model:
modelFormat:
name: sklearn
storageUri: gs://my-bucket/models/iris/v1
This single resource produces:
- A Knative Service running the
kserve-sklearnserverruntime container - An init container that downloads the model from
gs://... - A Knative ingress route at
http://my-model.<namespace>.example.com - Autoscaling based on RPS, with scale-to-zero
- Prometheus metrics on
/metrics - Health probes wired up
For LLMs you swap modelFormat.name: huggingface and storageUri: hf://... and you get a vLLM-backed pod with an OpenAI-compatible chat endpoint at /openai/v1/chat/completions. Same CRD, different runtime.
What KServe is NOT#
- Not a model training platform that’s Kubeflow Training, MLflow, or your favorite Jupyter setup. KServe takes already-trained models.
- Not a feature store or data pipeline Feast, Tecton, and your Spark/Beam jobs do that. KServe consumes the trained model artifact.
- Not a model registry MLflow, Vertex AI Model Registry, or Hugging Face Hub stores model versions. KServe pulls from them via
storageUri. - Not specifically for LLMs it serves predictive ML and GenAI through the same controller. The LLM features (vLLM runtime, OpenAI API, KV-cache offload) are additive, not the core.
Architecture#
KServe v0.18 in Serverless mode is a stack. From bottom up:
graph TB
subgraph k8s["Kubernetes"]
subgraph network["Network Layer"]
cilium["Cilium CNI<br/>(eBPF, no kube-proxy)"]
istio["Istio Ingress Gateway<br/>(routes by Host header)"]
end
subgraph serverless["Knative Serving 1.22"]
activator["Activator<br/>(buffers cold-start traffic)"]
autoscaler["Autoscaler / KPA<br/>(decides 0..N replicas)"]
netistio["net-istio<br/>(creates VirtualServices)"]
end
subgraph kserve["KServe v0.18"]
crd["InferenceService<br/>(serving.kserve.io/v1beta1)"]
controller["KServe Controller<br/>(reconciles ISVC -> Knative)"]
runtimes["ClusterServingRuntimes<br/>(sklearn, xgboost, huggingface,<br/>triton, ...)"]
end
subgraph workload["Predictor Pod"]
init["storage-initializer<br/>(downloads model)"]
container["kserve-container<br/>(framework-specific runtime)"]
queue["queue-proxy<br/>(Knative metrics + concurrency)"]
end
end
cilium --> istio
istio --> activator
activator --> netistio
netistio --> queue
queue --> container
crd -.-> controller
controller -.->|creates| workload
controller -.->|configures| autoscaler
runtimes -.->|template for container| container
init -.->|populates volume| container
Control plane vs data plane#
- Control plane: KServe controller + Knative controllers + KServe webhook + ClusterServingRuntimes. They watch your
InferenceServiceand translate it into Knative Services + Deployments + Services + ingress routes. - Data plane: The actual predictor pod (storage-init init-container, kserve-container, queue-proxy sidecar) and the Istio gateway routing traffic to it.
The InferenceService lifecycle#
- You apply an
InferenceServicemanifest. - KServe webhook validates it, fills in defaults from the matching
ClusterServingRuntime. - KServe controller creates a Knative Service for the predictor.
- Knative creates a Deployment, Service, and a Knative Revision.
- Istio’s net-istio creates a VirtualService routing the public hostname to the Knative Service.
- The predictor pod’s storage-initializer init-container downloads the model from
storageUri. - The kserve-container starts, loads the model into memory, and serves predictions on port 8080.
- The queue-proxy sidecar exposes Knative concurrency metrics so the autoscaler can decide replica count.
Two real workloads, side by side#
The demo deploys two InferenceService resources in the same namespace to show off KServe’s range:
| ISVC | Model | Runtime | Endpoint | Scale-to-zero |
|---|---|---|---|---|
sklearn-iris | sklearn iris classifier | KServe sklearnserver (built-in) | /v1/models/sklearn-iris:predict | Yes (minReplicas: 0) |
ollama | Qwen 2.5 0.5B Instruct | Ollama (custom predictor) | /v1/chat/completions (OpenAI) | No (minReplicas: 1) |
Same controller, same Knative ingress, same observability. Different runtimes and different scale profiles because predictive ML and LLM serving have different cost profiles for cold starts.
sklearn-iris - the canonical KServe demo#
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: llm
annotations:
autoscaling.knative.dev/window: "60s"
spec:
predictor:
minReplicas: 0
maxReplicas: 1
nodeSelector:
workload: inference
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
requests:
cpu: "100m"
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
The iris model is 4 features in (petal/sepal lengths) and a class out (0, 1, 2). The model file is a few KB. KServe pulls it from a public GCS bucket and the kserve-sklearnserver runtime serves it.
Ollama - LLM via the custom-predictor pattern#
KServe ships a built-in kserve-huggingfaceserver runtime that uses vLLM and exposes the OpenAI API. There’s a catch: at v0.18.0 the image is amd64-only. On Apple Silicon (arm64) the predictor pod hits ImagePullBackOff because there’s no arm64 manifest. This is a real production-relevant limitation right now, if your dev machines and CI runners are M-series Macs, you can’t run the built-in HF runtime locally.
The fix is the custom predictor pattern. KServe lets you bypass the built-in runtimes and bring your own container. We use Ollama, which ships multi-arch images and serves an OpenAI-compatible chat API out of the box.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: ollama
namespace: llm
annotations:
serving.knative.dev/progress-deadline: "10m"
spec:
predictor:
minReplicas: 1
maxReplicas: 1
nodeSelector:
workload: inference
timeout: 600
containers:
- name: kserve-container
image: ollama/ollama:latest
env:
- name: OLLAMA_HOST
value: "0.0.0.0:8080"
command: ["/bin/sh", "-c"]
args:
- |
set -e
ollama serve &
SERVE_PID=$!
until ollama list >/dev/null 2>&1; do sleep 1; done
ollama pull qwen2.5:0.5b
wait "$SERVE_PID"
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 60
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
The container starts the Ollama daemon, waits for it, pulls the qwen2.5:0.5b model (~400 MB), then keeps the daemon as PID 1. The result: an OpenAI-compatible /v1/chat/completions endpoint on port 8080.
We set minReplicas: 1 because cold-starting an LLM ISVC re-pulls the model every time. Acceptable cost for a sklearn pod that loads in 10s, unacceptable for a 400 MB model.
The Stack#
Eight Helm releases delivered via Flux, plus an imperative bootstrap (Cilium + Flux itself + the Knative Operator chart, which doesn’t ship as an OCI artifact). Same pattern as my Dapr citizen-developer post - Flux for everything Flux can consume, plain helm install for the rest.
| Component | Version | Managed by |
|---|---|---|
| Cilium | 1.19.3 | Helm (imperative bootstrap) |
| Flux CD | latest | flux install (imperative bootstrap) |
| metrics-server | 3.13.0 | Flux HelmRelease |
| cert-manager | v1.20.2 | Flux HelmRelease |
| Istio (base + istiod + gateway) | 1.29.2 | Flux HelmRelease |
| Knative Operator | v1.22.0 | Helm (imperative post-Flux) |
| Knative Serving | 1.22.0 | KnativeServing CR (kubectl apply) |
| KServe | v0.18.0 | Flux HelmRelease (3 charts: CRD + resources + runtime-configs) |
| Ollama | latest | Custom predictor container |
| Qwen LLM | qwen2.5:0.5b | Pulled by Ollama on first start |
flowchart LR
subgraph imperative["Imperative Bootstrap (runs once)"]
A["1. Kind Cluster"] --> B["2. Cilium CNI"]
B --> C["3. Flux Install"]
end
subgraph gitops["Flux GitOps"]
C --> D["4. HelmReleases"]
D --> D0["metrics-server"]
D --> D1["cert-manager"]
D --> D2["istio-base"]
D --> D3["istiod"]
D --> D4["istio-ingressgateway"]
D --> D5["kserve-crd"]
D --> D6["kserve-resources"]
D --> D7["kserve-runtime-configs<br/>(ClusterServingRuntimes)"]
end
subgraph post["Post-Flux"]
D --> E["5. Knative Operator + Serving CR"]
E --> F["6. Restart KServe controller<br/>(so it discovers Knative)"]
F --> G["7. Apply sklearn-iris ISVC"]
F --> H["8. Apply ollama ISVC"]
end
style imperative fill:#1a1a2e,color:#e0e0e0
style gitops fill:#16213e,color:#e0e0e0
style post fill:#0f3460,color:#e0e0e0
Three things deserve a callout because they tripped me up while building this.
1. KServe v0.18 splits runtime configs into a separate chart#
In earlier KServe versions, the kserve-resources Helm chart shipped both the controller and the ClusterServingRuntime resources for sklearn, xgboost, HuggingFace, etc. As of v0.18 those runtimes live in a separate kserve-runtime-configs chart that you must install in addition. If you skip it your InferenceService reconciles with error: no runtime found to support predictor with model type: {sklearn <nil>}. The Flux release in this demo installs all three charts: kserve-crd, kserve-resources, and kserve-runtime-configs.
2. KServe controller probes for Knative at startup#
The KServe controller checks for Knative Serving CRDs when it starts. If Knative is installed after the controller (which is what happens when Flux installs everything in parallel), the controller caches “Knative not available” and refuses to reconcile InferenceServices. The error in the controller log is unambiguous:
the resolved deployment mode of InferenceService 'qwen' is Knative,
but Knative Serving is not available
The fix is kubectl rollout restart deployment/kserve-controller-manager -n kserve after Knative comes up. The demo wires this into a flux:wait:kserve task that runs after knative:install.
3. KServe overrides Knative annotations#
Setting autoscaling.knative.dev/min-scale: "0" on the InferenceService metadata does nothing. KServe defaults minReplicas to 1 in its component config and writes that to the underlying Knative Service revision. To get scale-to-zero you must set spec.predictor.minReplicas: 0. I lost a debugging hour on this. The annotation appears to take effect on the ISVC but is silently overridden on the Knative Service.
Setup#
git clone https://github.com/nicknikolakakis/srekubecraft-demo.git
cd srekubecraft-demo/kserve
task setup
End-to-end ~15 minutes on a laptop. Real captured output below.
Phase 1: bootstrap (~5 min)#
$ task bootstrap
Creating Kind cluster 'kserve-llm'...
✓ Ensuring node image (kindest/node:v1.35.0)
✓ Preparing nodes
✓ Starting control-plane
✓ Joining worker nodes
Cluster 'kserve-llm' created.
Installing Cilium 1.19.3...
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: OK
\__/¯¯\__/ Hubble Relay: OK
DaemonSet cilium Desired: 3, Ready: 3/3
Cilium installed. All nodes ready.
Installing Flux...
✔ all checks passed
Bootstrap complete. Cluster has networking and GitOps.
The Kind cluster has 3 nodes: control plane + a system worker for controllers + an inference worker pinned to model serving via nodeSelector. Cilium replaces kube-proxy with eBPF (which we’ll need anyway if we add network policies later). Flux installs in flux-system.
Phase 2: Flux GitOps (~5 min)#
$ task flux:apply && task flux:wait
Applying Flux sources...
helmrepository.source.toolkit.fluxcd.io/metrics-server created
helmrepository.source.toolkit.fluxcd.io/jetstack created
helmrepository.source.toolkit.fluxcd.io/istio created
helmrepository.source.toolkit.fluxcd.io/kserve created
Applying metrics-server, cert-manager, Istio, KServe HelmReleases...
All HelmReleases applied. Flux is reconciling...
Waiting for cert-manager + Istio + metrics-server...
Waiting for metrics-server in kube-system...
helmrelease.helm.toolkit.fluxcd.io/metrics-server condition met
Waiting for cert-manager in cert-manager...
helmrelease.helm.toolkit.fluxcd.io/cert-manager condition met
Waiting for istio-base in istio-system...
helmrelease.helm.toolkit.fluxcd.io/istio-base condition met
Waiting for istiod in istio-system...
helmrelease.helm.toolkit.fluxcd.io/istiod condition met
Waiting for istio-ingressgateway in istio-system...
helmrelease.helm.toolkit.fluxcd.io/istio-ingressgateway condition met
Six HelmReleases up. kserve-crd is also Ready by now, but kserve-resources and kserve-runtime-configs will wait for Knative.
Phase 3: Knative Serving (~3 min)#
$ task knative:install
Installing Knative Operator v1.22.0...
NAME: knative-operator
LAST DEPLOYED: Sun May 3 08:41:43 2026
STATUS: deployed
Applying KnativeServing CR...
namespace/knative-serving created
knativeserving.operator.knative.dev/knative-serving created
Waiting for Knative Serving to be Ready...
knativeserving.operator.knative.dev/knative-serving condition met
Knative Serving is Ready.
Knative Operator installs from a tarball URL on GitHub Releases (no Helm repo index, no OCI artifact - hence imperative). The KnativeServing CR pins Knative Serving 1.22.0 with Istio as the network layer and enables three feature flags: kubernetes.podspec-nodeselector, kubernetes.podspec-tolerations, and kubernetes.podspec-affinity. Without these, KServe cannot pin pods to specific nodes through the Knative Service.
Phase 4: KServe controller restart + Ready#
$ task flux:wait:kserve
Restarting KServe controller so it discovers Knative Serving...
deployment.apps/kserve-controller-manager restarted
deployment "kserve-controller-manager" successfully rolled out
Forcing kserve-resources reconcile...
helmrelease.helm.toolkit.fluxcd.io/kserve-resources condition met
Forcing kserve-runtime-configs reconcile...
helmrelease.helm.toolkit.fluxcd.io/kserve-runtime-configs condition met
Verifying ClusterServingRuntimes are installed...
NAME DISABLED MODELTYPE CONTAINERS AGE
kserve-huggingfaceserver false huggingface kserve-container 62s
kserve-huggingfaceserver-multinode false huggingface kserve-container 62s
kserve-lgbserver false lightgbm kserve-container 62s
kserve-mlserver false sklearn kserve-container 62s
kserve-paddleserver false paddle kserve-container 62s
kserve-pmmlserver false pmml kserve-container 62s
kserve-predictiveserver false sklearn kserve-container 62s
kserve-sklearnserver false sklearn kserve-container 62s
kserve-tensorflow-serving false tensorflow kserve-container 62s
kserve-torchserve false pytorch kserve-container 62s
kserve-tritonserver false tensorrt kserve-container 62s
kserve-xgbserver false xgboost kserve-container 62s
Controller restart unblocks ISVC reconciliation. The runtime-configs chart installs all 12 ClusterServingRuntime resources at once: sklearn (and its mlserver variant), xgboost, lightgbm, paddle, pmml, the generic predictive runtime, tensorflow, torchserve (pytorch), tritonserver (tensorrt), and the two huggingface variants.
Phase 5: deploy both InferenceServices#
$ task isvc:apply && task isvc:wait
namespace/llm created
inferenceservice.serving.kserve.io/sklearn-iris created
Waiting for InferenceService 'sklearn-iris' to be Ready (timeout 15m)...
inferenceservice.serving.kserve.io/sklearn-iris condition met
InferenceService is Ready.
$ task isvc:apply-ollama && task isvc:wait-ollama
inferenceservice.serving.kserve.io/ollama created
Ollama ISVC created. First start pulls qwen2.5:0.5b (~400 MB) - takes 2-5 min.
Waiting for Ollama InferenceService to be Ready (timeout 15m)...
inferenceservice.serving.kserve.io/ollama condition met
Ollama InferenceService is Ready.
Final state:
$ flux get helmreleases -A
NAMESPACE NAME REVISION READY MESSAGE
cert-manager cert-manager v1.20.2 True Helm install succeeded
istio-system istio-base 1.29.2 True Helm install succeeded
istio-system istio-ingressgateway 1.29.2 True Helm install succeeded
istio-system istiod 1.29.2 True Helm install succeeded
kserve kserve-crd v0.18.0 True Helm install succeeded
kserve kserve-resources v0.18.0 True Helm install succeeded
kserve kserve-runtime-configs v0.18.0 True Helm install succeeded
kube-system metrics-server 3.13.0 True Helm install succeeded
$ kubectl get isvc -n llm
NAME URL READY AGE
ollama http://ollama.llm.example.com True 5m
sklearn-iris http://sklearn-iris.llm.example.com True 7m
$ kubectl top pods -n llm
NAME CPU(cores) MEMORY(bytes)
ollama-predictor-00001-deployment-... 1m 712Mi
sklearn-iris-predictor-00001-deployment-... 6m 197Mi
Total cluster memory at rest: ~5.9 GB on a 7.75 GB Docker allocation. The Ollama pod holds 712 MiB because the model is loaded; the sklearn pod is 197 MiB.
Live testing#
sklearn-iris prediction#
$ task test:predict
==> InferenceService URL: http://sklearn-iris.llm.example.com
==> Sending prediction (2 iris flowers)...
{
"predictions": [
1,
1
]
}
Two iris flowers in (4 floats each: sepal length, sepal width, petal length, petal width). Two class predictions out (1 = versicolor). Sub-second response.
This is the KServe v1 protocol. The endpoint pattern is /v1/models/<name>:predict. The same URL is reachable from outside the cluster through the Istio gateway with a Host: sklearn-iris.llm.example.com header, and from inside the cluster at http://sklearn-iris-predictor.llm.svc.cluster.local.
Ollama chat completion (OpenAI-compatible)#
$ task test:chat-ollama
==> Sending chat completion request to Ollama via in-cluster curl...
(CPU LLM inference can take 30-90s for the first token)
{
"id": "chatcmpl-805",
"object": "chat.completion",
"created": 1777822980,
"model": "qwen2.5:0.5b",
"system_fingerprint": "fp_ollama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "A Kubernetes Pod is a collection of containers that serve as units within a cluster and can be easily managed by the Kubernetes API server."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 31,
"completion_tokens": 27,
"total_tokens": 58
}
}
The response shape matches OpenAI’s chat-completions API exactly. Any client that speaks the OpenAI SDK works against this endpoint with no modifications, just point at http://ollama.llm.example.com/v1/chat/completions instead of https://api.openai.com/v1/chat/completions. Latency on CPU for a 0.5B model is ~5-10s for short responses, ~30-90s for longer ones.
Scale-to-zero#
The sklearn-iris ISVC has minReplicas: 0. After 30 seconds of no traffic (the scale-to-zero-grace-period we set in the KnativeServing CR) the predictor pod terminates:
$ task test:scale-to-zero
18:50:53 active=1 sklearn-iris-predictor-00002-deployment-... 2/2 Running
18:51:29 active=0 (no pods)
SCALED-TO-ZERO at 18:51:29
The Ollama ISVC has minReplicas: 1 so its pod stays running. This is the right pattern for LLM serving cold-starting a 7B-parameter model is not something you want to do per-request.
Cold start#
Send a prediction request to the now-zero-pods sklearn-iris and Knative’s Activator buffers it, scales the deployment from 0 to 1, waits for the pod to be Ready, and forwards the buffered request:
$ task test:cold-start
==> Confirming pod count before request:
(none - scaled to zero)
==> Sending prediction request and timing it...
{
"predictions": [1]
}
real 0m12.072s
user 0m0.000s
sys 0m0.000s
12 seconds end-to-end for a cold-start prediction. That includes pod scheduling, image pull from local cache, storage-initializer downloading the (tiny) sklearn model, kserve-container startup, model load, and the actual inference. For a small classifier this is fast; for an LLM it would be measured in minutes (image pull alone for vLLM is several GB).
For comparison, a warm request through the same in-cluster path completes in ~3.5 seconds (mostly kubectl run overhead the actual inference is sub-second).
What KServe does well#
| Strength | Detail |
|---|---|
| Single CRD, many runtimes | One InferenceService shape works for sklearn, XGBoost, PyTorch, TensorFlow, ONNX, Triton, HuggingFace/vLLM. Switching frameworks is a modelFormat.name change. |
| OpenAI-compatible LLM API | The HF runtime exposes /openai/v1/chat/completions and /openai/v1/completions out of the box. SDKs that speak OpenAI work unchanged. |
| Real scale-to-zero | Knative Serving’s Activator buffers cold-start traffic. No requests dropped while the pod scales from 0 to 1. |
| Pluggable storage | storageUri: accepts gs://, s3://, https://, pvc://, and hf://. Storage-initializer init-container handles credentials via Service Account annotations. |
| Multiple model formats per cluster | Built-in ClusterServingRuntime resources for 12+ frameworks. You can add your own. |
| Custom predictor escape hatch | When a built-in runtime doesn’t fit (or has no arm64 image), drop in any container. KServe just creates the Knative Service around it. |
| InferenceGraph CRD | Chain pre-processor + model + post-processor in a single graph. Useful for ensembles or when you need a transformer in front of a model. |
| Observability for free | Prometheus metrics exposed on /metrics from the queue-proxy. RPS, latency p50/p95/p99, request volume per revision. |
| Traffic splitting | Knative gives you traffic blocks - send 90% to revision-N, 10% to revision-N+1. Standard canary pattern, no service mesh CRDs needed. |
What KServe does less well#
| Limitation | Detail |
|---|---|
| Stack heaviness | KServe Serverless mode requires Knative + Istio + cert-manager + the KServe stack itself. That’s 30+ pods of control plane before you serve a single model. RawDeployment mode trades scale-to-zero for a lighter stack. |
| Three Helm charts | As of v0.18 you install kserve-crd, kserve-resources, and kserve-runtime-configs separately. Easy to skip one and not realize until your ISVC fails to reconcile. |
| Controller startup ordering | KServe controller caches Knative-availability at startup. Install Knative after the controller and reconciliation silently breaks. Fix is a rollout restart. |
| arm64 LLM image gap | kserve/huggingfaceserver is amd64-only at v0.18.0. On Apple Silicon the built-in vLLM runtime hits ImagePullBackOff. The custom-predictor pattern works around this but loses the KServe-native model loader. |
min-scale annotation override | Annotations on the ISVC metadata are silently overridden by KServe’s component defaults. Use spec.predictor.minReplicas instead. The docs do not flag this clearly. |
| GPU scheduling complexity | KServe doesn’t solve GPU sharing or fractional GPU scheduling. You need Volcano, Kueue, or NVIDIA’s GPU Operator on top. |
| Documentation lag | The docs sometimes lag the code - v0.17 vs v0.18 vs nightly examples are mixed across the website. The kserve-deps.env file in the source is the most reliable version-pin reference. |
| Knative + Istio sticker shock | If you don’t already run Knative or Istio, adopting them just for KServe is a big lift. RawDeployment mode is more accessible. |
When to use KServe#
Use it when:#
- You’re serving multiple model frameworks and want one platform pattern instead of one per framework
- You need scale-to-zero for predictive ML to control cost on rarely-used models
- You want OpenAI-compatible LLM serving without writing your own gateway in front of vLLM
- You’re already running Knative or Istio the marginal cost is much lower
- You need traffic splitting / canary rollouts for model versions
- You’re on amd64 / GPU nodes the built-in HF runtime is the smoothest path
Consider alternatives when:#
- You only serve LLMs and want the lightest stack - run Ollama, vLLM, or TGI directly as a Deployment+Service. No KServe, no Knative.
- You’re on Apple Silicon for development and want the built-in HF runtime - wait for arm64 images, or use the custom-predictor pattern with Ollama like this demo does.
- You don’t want Knative or Istio - try Seldon Core v2 (which uses Kafka and Envoy directly) or BentoML.
- Your scale is one model, one team - the value of a multi-runtime, multi-team platform is small. A Flask app behind a Service is fine.
- You need fractional GPU scheduling - layer Volcano or Kueue on top, or use Run:ai as the scheduler.
Production considerations#
The demo runs on a laptop with everything pinned to a single inference node. A real platform looks different in a few key ways.
GPU nodepools#
For LLM serving in production you want a GPU nodepool with nvidia.com/gpu resource limits, the NVIDIA GPU Operator installed, and the predictor pod requesting nvidia.com/gpu: "1". The KServe HF runtime on GPU uses vLLM’s PagedAttention and KV-cache offload, which is dramatically faster than CPU inference. A 7B-parameter model on an A10 typically serves 50-200 tokens/sec; on CPU you’re looking at 1-5 tokens/sec.
Model caching#
Cold-starting an LLM pod re-pulls the model (or downloads it from hf://). KServe v0.17+ added the LocalModelCache CRD which caches models on each node’s local disk so the second cold-start on the same node is fast. For demo purposes, persist models in a PVC mounted at /root/.ollama (Ollama) or /mnt/models (HF runtime).
Autoscaling on the right metric#
Knative’s default autoscaler scales on concurrency (in-flight requests). For LLM serving this is the wrong metric - one user can hold a connection open for 30s while generating tokens. Switch to the HPA-class autoscaler and scale on token throughput or queue depth via Prometheus Adapter.
Authentication#
KServe doesn’t ship auth. Front it with oauth2-proxy (which also has an arm64 image, by the way) for OIDC, or use the Envoy AI Gateway integration that KServe added in v0.18 for AI-aware routing and rate limiting.
Observability#
The queue-proxy exposes /metrics on port 9090. Scrape it with Prometheus, build dashboards on revision_request_count, revision_request_latencies, and request_concurrency. The KServe controller exposes its own metrics on port 8080 - reconcile loop latency, controller errors, ISVC count by status.
Cost#
A laptop demo costs nothing. A production cluster running 24/7 GPU pods at always-on is expensive. Three knobs:
minReplicas: 0for predictive ML - even small CPU pods at 24/7 add up. Scale to zero between requests.- GPU node pools with autoscaling - cluster-autoscaler or Karpenter scaling the node pool itself, not just the pods.
- Right-size the model - serving Llama-70B for a chatbot that mostly asks “what’s the capital of France?” is wasteful. Smaller distilled models often suffice.
Conclusion#
KServe is now a CNCF incubating project, and after running it through a real laptop demo it earns the badge. The InferenceService CRD is the right level of abstraction. Predictive ML and GenAI through the same controller, the same observability, the same autoscaling primitives - that’s exactly what a Kubernetes-native ML platform should look like.
The arm64 gap on the HF runtime is real but the custom-predictor pattern works around it cleanly. The three controller-restart and chart-split gotchas cost a debugging session each but are all in the Taskfile now so the next person doesn’t trip on them. Total time from git clone to live LLM chat: ~20 minutes.
If you’re standing up an ML serving platform on Kubernetes in 2026 and you’re not already committed to Seldon or BentoML, KServe is the obvious starting point. Especially now that it has the CNCF stamp and the LLM features are first-class.
The full demo (Flux manifests, Taskfile, Mermaid diagrams, ISVC manifests for sklearn, Ollama, and the reference HF runtime config for amd64 hosts) is at github.com/nicknikolakakis/srekubecraft-demo/tree/main/kserve.

