KServe - Production ML Serving on Kubernetes, from sklearn to LLMs
~/posts/kserve.md20 min · 4183 words

KServe - Production ML Serving on Kubernetes, from sklearn to LLMs

// How to use KServe v0.18 in Serverless mode to serve both classical ML models and LLMs on Kubernetes - with scale-to-zero, OpenAI-compatible APIs, Flux GitOps, and a real demo on Apple Silicon.

$ date

In November 2025, KServe joined the CNCF as an incubating project. For a project that started life as KFServing inside Kubeflow back in 2019, this was the formal recognition that ML model serving on Kubernetes had grown up. KServe is now the closest thing the cloud-native ecosystem has to a standard for putting trained models behind an API: scikit-learn, XGBoost, PyTorch, TensorFlow, ONNX, Triton, and increasingly Large Language Models all served through the same InferenceService CRD with the same scale-to-zero, autoscaling, traffic splitting, and observability primitives.

This post walks through what KServe is, what’s new in v0.18, and how to run a real two-model demo on a laptop: a scikit-learn iris classifier (predictive ML, scales to zero) and a Qwen 2.5 0.5B Instruct LLM served through Ollama with an OpenAI-compatible chat API. Everything runs on a Kind cluster on Apple Silicon - no GPU required - and is delivered via Flux GitOps with cert-manager, Istio, Knative Serving, and metrics-server. The full demo lives in srekubecraft-demo/kserve/.

Who Should Read This?#

This post is for:

  • Platform Engineers standing up an internal ML serving platform on Kubernetes
  • SREs who need to operate model serving like any other production service - probes, metrics, autoscaling, GitOps
  • ML Platform Teams evaluating KServe vs alternatives like Seldon Core, BentoML, or rolling their own
  • Engineers exploring GenAI on Kubernetes who want to self-host LLMs behind an OpenAI-compatible API instead of paying for hosted inference
  • Anyone who saw the CNCF announcement and wants to know what KServe actually does in practice

TL;DR#

Problem: Trained ML models need to be served behind an API. Doing this well at scale means request batching, autoscaling, scale-to-zero, traffic splitting, model versioning, observability, and a unified API across model frameworks. Building this yourself is months of work.

Solution: KServe v0.18 provides an InferenceService CRD that wraps Knative + Istio and serves any of a dozen model formats out of the box. For LLMs it speaks OpenAI’s chat-completions API. For predictive ML it speaks the v1 protocol (:predict). The same controller does both.

Result: A single CRD, two real workloads side by side: sklearn iris classifier scales to zero between requests, Qwen 0.5B LLM stays warm and answers chat completions. Total time on a laptop: ~20 minutes from task setup to live chat. Full demo repo.


The ML Serving Problem#

Putting a trained model behind an API sounds simple. You serialize the model, write a Flask app, expose /predict, deploy it. Done. Until you’re not.

Reality at scale:

  • Resource efficiency: a model that gets one request per minute should not run 24/7 on a GPU node. But cold-starting a 7B-parameter LLM takes 30 seconds. So when do you scale down?
  • Multiple model formats: your data scientists train in PyTorch, your ETL team uses sklearn, your forecasting team uses XGBoost. Three different serving stacks, three different deployment patterns, three different metric pipelines.
  • Autoscaling on the right metric: LLM serving needs autoscaling on token throughput and queue depth, not CPU. Predictive ML autoscales fine on RPS.
  • A/B testing and canary rollouts: you want to ship a new model version to 5% of traffic without writing custom routing logic.
  • Observability: request count, latency, model load time, prediction histograms - per model, with no extra code in the model server.
  • Standard APIs: the world has settled on OpenAI’s chat-completions API for LLMs and a /v1/models/<name>:predict shape for predictive ML. Your serving layer should speak both.

KServe’s pitch is that all of the above is plumbing the platform team should provide once. Data scientists ship models, the platform serves them.

What is KServe?#

KServe is a Kubernetes Custom Resource Definition (InferenceService) plus a controller that creates everything underneath it - Knative Service, Pods, Services, autoscaler config, ingress routes - based on the model format you declare.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    minReplicas: 0
    maxReplicas: 10
    model:
      modelFormat:
        name: sklearn
      storageUri: gs://my-bucket/models/iris/v1

This single resource produces:

  • A Knative Service running the kserve-sklearnserver runtime container
  • An init container that downloads the model from gs://...
  • A Knative ingress route at http://my-model.<namespace>.example.com
  • Autoscaling based on RPS, with scale-to-zero
  • Prometheus metrics on /metrics
  • Health probes wired up

For LLMs you swap modelFormat.name: huggingface and storageUri: hf://... and you get a vLLM-backed pod with an OpenAI-compatible chat endpoint at /openai/v1/chat/completions. Same CRD, different runtime.

What KServe is NOT#

  • Not a model training platform that’s Kubeflow Training, MLflow, or your favorite Jupyter setup. KServe takes already-trained models.
  • Not a feature store or data pipeline Feast, Tecton, and your Spark/Beam jobs do that. KServe consumes the trained model artifact.
  • Not a model registry MLflow, Vertex AI Model Registry, or Hugging Face Hub stores model versions. KServe pulls from them via storageUri.
  • Not specifically for LLMs it serves predictive ML and GenAI through the same controller. The LLM features (vLLM runtime, OpenAI API, KV-cache offload) are additive, not the core.

Architecture#

KServe v0.18 in Serverless mode is a stack. From bottom up:

graph TB
    subgraph k8s["Kubernetes"]
        subgraph network["Network Layer"]
            cilium["Cilium CNI<br/>(eBPF, no kube-proxy)"]
            istio["Istio Ingress Gateway<br/>(routes by Host header)"]
        end

        subgraph serverless["Knative Serving 1.22"]
            activator["Activator<br/>(buffers cold-start traffic)"]
            autoscaler["Autoscaler / KPA<br/>(decides 0..N replicas)"]
            netistio["net-istio<br/>(creates VirtualServices)"]
        end

        subgraph kserve["KServe v0.18"]
            crd["InferenceService<br/>(serving.kserve.io/v1beta1)"]
            controller["KServe Controller<br/>(reconciles ISVC -> Knative)"]
            runtimes["ClusterServingRuntimes<br/>(sklearn, xgboost, huggingface,<br/>triton, ...)"]
        end

        subgraph workload["Predictor Pod"]
            init["storage-initializer<br/>(downloads model)"]
            container["kserve-container<br/>(framework-specific runtime)"]
            queue["queue-proxy<br/>(Knative metrics + concurrency)"]
        end
    end

    cilium --> istio
    istio --> activator
    activator --> netistio
    netistio --> queue
    queue --> container

    crd -.-> controller
    controller -.->|creates| workload
    controller -.->|configures| autoscaler
    runtimes -.->|template for container| container
    init -.->|populates volume| container

Control plane vs data plane#

  • Control plane: KServe controller + Knative controllers + KServe webhook + ClusterServingRuntimes. They watch your InferenceService and translate it into Knative Services + Deployments + Services + ingress routes.
  • Data plane: The actual predictor pod (storage-init init-container, kserve-container, queue-proxy sidecar) and the Istio gateway routing traffic to it.

The InferenceService lifecycle#

  1. You apply an InferenceService manifest.
  2. KServe webhook validates it, fills in defaults from the matching ClusterServingRuntime.
  3. KServe controller creates a Knative Service for the predictor.
  4. Knative creates a Deployment, Service, and a Knative Revision.
  5. Istio’s net-istio creates a VirtualService routing the public hostname to the Knative Service.
  6. The predictor pod’s storage-initializer init-container downloads the model from storageUri.
  7. The kserve-container starts, loads the model into memory, and serves predictions on port 8080.
  8. The queue-proxy sidecar exposes Knative concurrency metrics so the autoscaler can decide replica count.

Two real workloads, side by side#

The demo deploys two InferenceService resources in the same namespace to show off KServe’s range:

ISVCModelRuntimeEndpointScale-to-zero
sklearn-irissklearn iris classifierKServe sklearnserver (built-in)/v1/models/sklearn-iris:predictYes (minReplicas: 0)
ollamaQwen 2.5 0.5B InstructOllama (custom predictor)/v1/chat/completions (OpenAI)No (minReplicas: 1)

Same controller, same Knative ingress, same observability. Different runtimes and different scale profiles because predictive ML and LLM serving have different cost profiles for cold starts.

sklearn-iris - the canonical KServe demo#

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: llm
  annotations:
    autoscaling.knative.dev/window: "60s"
spec:
  predictor:
    minReplicas: 0
    maxReplicas: 1
    nodeSelector:
      workload: inference
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests:
          cpu: "100m"
          memory: 512Mi
        limits:
          cpu: "1"
          memory: 1Gi

The iris model is 4 features in (petal/sepal lengths) and a class out (0, 1, 2). The model file is a few KB. KServe pulls it from a public GCS bucket and the kserve-sklearnserver runtime serves it.

Ollama - LLM via the custom-predictor pattern#

KServe ships a built-in kserve-huggingfaceserver runtime that uses vLLM and exposes the OpenAI API. There’s a catch: at v0.18.0 the image is amd64-only. On Apple Silicon (arm64) the predictor pod hits ImagePullBackOff because there’s no arm64 manifest. This is a real production-relevant limitation right now, if your dev machines and CI runners are M-series Macs, you can’t run the built-in HF runtime locally.

The fix is the custom predictor pattern. KServe lets you bypass the built-in runtimes and bring your own container. We use Ollama, which ships multi-arch images and serves an OpenAI-compatible chat API out of the box.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: ollama
  namespace: llm
  annotations:
    serving.knative.dev/progress-deadline: "10m"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 1
    nodeSelector:
      workload: inference
    timeout: 600
    containers:
      - name: kserve-container
        image: ollama/ollama:latest
        env:
          - name: OLLAMA_HOST
            value: "0.0.0.0:8080"
        command: ["/bin/sh", "-c"]
        args:
          - |
            set -e
            ollama serve &
            SERVE_PID=$!
            until ollama list >/dev/null 2>&1; do sleep 1; done
            ollama pull qwen2.5:0.5b
            wait "$SERVE_PID"
        ports:
          - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
          failureThreshold: 60
        resources:
          requests:
            cpu: "1"
            memory: 2Gi
          limits:
            cpu: "2"
            memory: 4Gi

The container starts the Ollama daemon, waits for it, pulls the qwen2.5:0.5b model (~400 MB), then keeps the daemon as PID 1. The result: an OpenAI-compatible /v1/chat/completions endpoint on port 8080.

We set minReplicas: 1 because cold-starting an LLM ISVC re-pulls the model every time. Acceptable cost for a sklearn pod that loads in 10s, unacceptable for a 400 MB model.

The Stack#

Eight Helm releases delivered via Flux, plus an imperative bootstrap (Cilium + Flux itself + the Knative Operator chart, which doesn’t ship as an OCI artifact). Same pattern as my Dapr citizen-developer post - Flux for everything Flux can consume, plain helm install for the rest.

ComponentVersionManaged by
Cilium1.19.3Helm (imperative bootstrap)
Flux CDlatestflux install (imperative bootstrap)
metrics-server3.13.0Flux HelmRelease
cert-managerv1.20.2Flux HelmRelease
Istio (base + istiod + gateway)1.29.2Flux HelmRelease
Knative Operatorv1.22.0Helm (imperative post-Flux)
Knative Serving1.22.0KnativeServing CR (kubectl apply)
KServev0.18.0Flux HelmRelease (3 charts: CRD + resources + runtime-configs)
OllamalatestCustom predictor container
Qwen LLMqwen2.5:0.5bPulled by Ollama on first start
flowchart LR
    subgraph imperative["Imperative Bootstrap (runs once)"]
        A["1. Kind Cluster"] --> B["2. Cilium CNI"]
        B --> C["3. Flux Install"]
    end

    subgraph gitops["Flux GitOps"]
        C --> D["4. HelmReleases"]
        D --> D0["metrics-server"]
        D --> D1["cert-manager"]
        D --> D2["istio-base"]
        D --> D3["istiod"]
        D --> D4["istio-ingressgateway"]
        D --> D5["kserve-crd"]
        D --> D6["kserve-resources"]
        D --> D7["kserve-runtime-configs<br/>(ClusterServingRuntimes)"]
    end

    subgraph post["Post-Flux"]
        D --> E["5. Knative Operator + Serving CR"]
        E --> F["6. Restart KServe controller<br/>(so it discovers Knative)"]
        F --> G["7. Apply sklearn-iris ISVC"]
        F --> H["8. Apply ollama ISVC"]
    end

    style imperative fill:#1a1a2e,color:#e0e0e0
    style gitops fill:#16213e,color:#e0e0e0
    style post fill:#0f3460,color:#e0e0e0

Three things deserve a callout because they tripped me up while building this.

1. KServe v0.18 splits runtime configs into a separate chart#

In earlier KServe versions, the kserve-resources Helm chart shipped both the controller and the ClusterServingRuntime resources for sklearn, xgboost, HuggingFace, etc. As of v0.18 those runtimes live in a separate kserve-runtime-configs chart that you must install in addition. If you skip it your InferenceService reconciles with error: no runtime found to support predictor with model type: {sklearn <nil>}. The Flux release in this demo installs all three charts: kserve-crd, kserve-resources, and kserve-runtime-configs.

2. KServe controller probes for Knative at startup#

The KServe controller checks for Knative Serving CRDs when it starts. If Knative is installed after the controller (which is what happens when Flux installs everything in parallel), the controller caches “Knative not available” and refuses to reconcile InferenceServices. The error in the controller log is unambiguous:

the resolved deployment mode of InferenceService 'qwen' is Knative,
but Knative Serving is not available

The fix is kubectl rollout restart deployment/kserve-controller-manager -n kserve after Knative comes up. The demo wires this into a flux:wait:kserve task that runs after knative:install.

3. KServe overrides Knative annotations#

Setting autoscaling.knative.dev/min-scale: "0" on the InferenceService metadata does nothing. KServe defaults minReplicas to 1 in its component config and writes that to the underlying Knative Service revision. To get scale-to-zero you must set spec.predictor.minReplicas: 0. I lost a debugging hour on this. The annotation appears to take effect on the ISVC but is silently overridden on the Knative Service.

Setup#

git clone https://github.com/nicknikolakakis/srekubecraft-demo.git
cd srekubecraft-demo/kserve
task setup

End-to-end ~15 minutes on a laptop. Real captured output below.

Phase 1: bootstrap (~5 min)#

$ task bootstrap
Creating Kind cluster 'kserve-llm'...
 ✓ Ensuring node image (kindest/node:v1.35.0)
 ✓ Preparing nodes
 ✓ Starting control-plane
 ✓ Joining worker nodes
Cluster 'kserve-llm' created.

Installing Cilium 1.19.3...
   /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       OK
DaemonSet         cilium             Desired: 3, Ready: 3/3
Cilium installed. All nodes ready.

Installing Flux...
✔ all checks passed
Bootstrap complete. Cluster has networking and GitOps.

The Kind cluster has 3 nodes: control plane + a system worker for controllers + an inference worker pinned to model serving via nodeSelector. Cilium replaces kube-proxy with eBPF (which we’ll need anyway if we add network policies later). Flux installs in flux-system.

Phase 2: Flux GitOps (~5 min)#

$ task flux:apply && task flux:wait
Applying Flux sources...
helmrepository.source.toolkit.fluxcd.io/metrics-server created
helmrepository.source.toolkit.fluxcd.io/jetstack created
helmrepository.source.toolkit.fluxcd.io/istio created
helmrepository.source.toolkit.fluxcd.io/kserve created
Applying metrics-server, cert-manager, Istio, KServe HelmReleases...
All HelmReleases applied. Flux is reconciling...
Waiting for cert-manager + Istio + metrics-server...
  Waiting for metrics-server in kube-system...
helmrelease.helm.toolkit.fluxcd.io/metrics-server condition met
  Waiting for cert-manager in cert-manager...
helmrelease.helm.toolkit.fluxcd.io/cert-manager condition met
  Waiting for istio-base in istio-system...
helmrelease.helm.toolkit.fluxcd.io/istio-base condition met
  Waiting for istiod in istio-system...
helmrelease.helm.toolkit.fluxcd.io/istiod condition met
  Waiting for istio-ingressgateway in istio-system...
helmrelease.helm.toolkit.fluxcd.io/istio-ingressgateway condition met

Six HelmReleases up. kserve-crd is also Ready by now, but kserve-resources and kserve-runtime-configs will wait for Knative.

Phase 3: Knative Serving (~3 min)#

$ task knative:install
Installing Knative Operator v1.22.0...
NAME: knative-operator
LAST DEPLOYED: Sun May  3 08:41:43 2026
STATUS: deployed
Applying KnativeServing CR...
namespace/knative-serving created
knativeserving.operator.knative.dev/knative-serving created
Waiting for Knative Serving to be Ready...
knativeserving.operator.knative.dev/knative-serving condition met
Knative Serving is Ready.

Knative Operator installs from a tarball URL on GitHub Releases (no Helm repo index, no OCI artifact - hence imperative). The KnativeServing CR pins Knative Serving 1.22.0 with Istio as the network layer and enables three feature flags: kubernetes.podspec-nodeselector, kubernetes.podspec-tolerations, and kubernetes.podspec-affinity. Without these, KServe cannot pin pods to specific nodes through the Knative Service.

Phase 4: KServe controller restart + Ready#

$ task flux:wait:kserve
Restarting KServe controller so it discovers Knative Serving...
deployment.apps/kserve-controller-manager restarted
deployment "kserve-controller-manager" successfully rolled out
Forcing kserve-resources reconcile...
helmrelease.helm.toolkit.fluxcd.io/kserve-resources condition met
Forcing kserve-runtime-configs reconcile...
helmrelease.helm.toolkit.fluxcd.io/kserve-runtime-configs condition met
Verifying ClusterServingRuntimes are installed...
NAME                                 DISABLED   MODELTYPE     CONTAINERS         AGE
kserve-huggingfaceserver             false      huggingface   kserve-container   62s
kserve-huggingfaceserver-multinode   false      huggingface   kserve-container   62s
kserve-lgbserver                     false      lightgbm      kserve-container   62s
kserve-mlserver                      false      sklearn       kserve-container   62s
kserve-paddleserver                  false      paddle        kserve-container   62s
kserve-pmmlserver                    false      pmml          kserve-container   62s
kserve-predictiveserver              false      sklearn       kserve-container   62s
kserve-sklearnserver                 false      sklearn       kserve-container   62s
kserve-tensorflow-serving            false      tensorflow    kserve-container   62s
kserve-torchserve                    false      pytorch       kserve-container   62s
kserve-tritonserver                  false      tensorrt      kserve-container   62s
kserve-xgbserver                     false      xgboost       kserve-container   62s

Controller restart unblocks ISVC reconciliation. The runtime-configs chart installs all 12 ClusterServingRuntime resources at once: sklearn (and its mlserver variant), xgboost, lightgbm, paddle, pmml, the generic predictive runtime, tensorflow, torchserve (pytorch), tritonserver (tensorrt), and the two huggingface variants.

Phase 5: deploy both InferenceServices#

$ task isvc:apply && task isvc:wait
namespace/llm created
inferenceservice.serving.kserve.io/sklearn-iris created
Waiting for InferenceService 'sklearn-iris' to be Ready (timeout 15m)...
inferenceservice.serving.kserve.io/sklearn-iris condition met
InferenceService is Ready.

$ task isvc:apply-ollama && task isvc:wait-ollama
inferenceservice.serving.kserve.io/ollama created
Ollama ISVC created. First start pulls qwen2.5:0.5b (~400 MB) - takes 2-5 min.
Waiting for Ollama InferenceService to be Ready (timeout 15m)...
inferenceservice.serving.kserve.io/ollama condition met
Ollama InferenceService is Ready.

Final state:

$ flux get helmreleases -A
NAMESPACE       NAME                    REVISION   READY   MESSAGE
cert-manager    cert-manager            v1.20.2    True    Helm install succeeded
istio-system    istio-base              1.29.2     True    Helm install succeeded
istio-system    istio-ingressgateway    1.29.2     True    Helm install succeeded
istio-system    istiod                  1.29.2     True    Helm install succeeded
kserve          kserve-crd              v0.18.0    True    Helm install succeeded
kserve          kserve-resources        v0.18.0    True    Helm install succeeded
kserve          kserve-runtime-configs  v0.18.0    True    Helm install succeeded
kube-system     metrics-server          3.13.0     True    Helm install succeeded

$ kubectl get isvc -n llm
NAME           URL                                   READY   AGE
ollama         http://ollama.llm.example.com         True    5m
sklearn-iris   http://sklearn-iris.llm.example.com   True    7m

$ kubectl top pods -n llm
NAME                                                       CPU(cores)   MEMORY(bytes)
ollama-predictor-00001-deployment-...                      1m           712Mi
sklearn-iris-predictor-00001-deployment-...                6m           197Mi

Total cluster memory at rest: ~5.9 GB on a 7.75 GB Docker allocation. The Ollama pod holds 712 MiB because the model is loaded; the sklearn pod is 197 MiB.

Live testing#

sklearn-iris prediction#

$ task test:predict
==> InferenceService URL: http://sklearn-iris.llm.example.com
==> Sending prediction (2 iris flowers)...

{
  "predictions": [
    1,
    1
  ]
}

Two iris flowers in (4 floats each: sepal length, sepal width, petal length, petal width). Two class predictions out (1 = versicolor). Sub-second response.

This is the KServe v1 protocol. The endpoint pattern is /v1/models/<name>:predict. The same URL is reachable from outside the cluster through the Istio gateway with a Host: sklearn-iris.llm.example.com header, and from inside the cluster at http://sklearn-iris-predictor.llm.svc.cluster.local.

Ollama chat completion (OpenAI-compatible)#

$ task test:chat-ollama
==> Sending chat completion request to Ollama via in-cluster curl...
    (CPU LLM inference can take 30-90s for the first token)

{
  "id": "chatcmpl-805",
  "object": "chat.completion",
  "created": 1777822980,
  "model": "qwen2.5:0.5b",
  "system_fingerprint": "fp_ollama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "A Kubernetes Pod is a collection of containers that serve as units within a cluster and can be easily managed by the Kubernetes API server."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 27,
    "total_tokens": 58
  }
}

The response shape matches OpenAI’s chat-completions API exactly. Any client that speaks the OpenAI SDK works against this endpoint with no modifications, just point at http://ollama.llm.example.com/v1/chat/completions instead of https://api.openai.com/v1/chat/completions. Latency on CPU for a 0.5B model is ~5-10s for short responses, ~30-90s for longer ones.

Scale-to-zero#

The sklearn-iris ISVC has minReplicas: 0. After 30 seconds of no traffic (the scale-to-zero-grace-period we set in the KnativeServing CR) the predictor pod terminates:

$ task test:scale-to-zero
18:50:53 active=1 sklearn-iris-predictor-00002-deployment-... 2/2 Running
18:51:29 active=0 (no pods)
SCALED-TO-ZERO at 18:51:29

The Ollama ISVC has minReplicas: 1 so its pod stays running. This is the right pattern for LLM serving cold-starting a 7B-parameter model is not something you want to do per-request.

Cold start#

Send a prediction request to the now-zero-pods sklearn-iris and Knative’s Activator buffers it, scales the deployment from 0 to 1, waits for the pod to be Ready, and forwards the buffered request:

$ task test:cold-start
==> Confirming pod count before request:
(none - scaled to zero)

==> Sending prediction request and timing it...
{
  "predictions": [1]
}
real    0m12.072s
user    0m0.000s
sys     0m0.000s

12 seconds end-to-end for a cold-start prediction. That includes pod scheduling, image pull from local cache, storage-initializer downloading the (tiny) sklearn model, kserve-container startup, model load, and the actual inference. For a small classifier this is fast; for an LLM it would be measured in minutes (image pull alone for vLLM is several GB).

For comparison, a warm request through the same in-cluster path completes in ~3.5 seconds (mostly kubectl run overhead the actual inference is sub-second).

What KServe does well#

StrengthDetail
Single CRD, many runtimesOne InferenceService shape works for sklearn, XGBoost, PyTorch, TensorFlow, ONNX, Triton, HuggingFace/vLLM. Switching frameworks is a modelFormat.name change.
OpenAI-compatible LLM APIThe HF runtime exposes /openai/v1/chat/completions and /openai/v1/completions out of the box. SDKs that speak OpenAI work unchanged.
Real scale-to-zeroKnative Serving’s Activator buffers cold-start traffic. No requests dropped while the pod scales from 0 to 1.
Pluggable storagestorageUri: accepts gs://, s3://, https://, pvc://, and hf://. Storage-initializer init-container handles credentials via Service Account annotations.
Multiple model formats per clusterBuilt-in ClusterServingRuntime resources for 12+ frameworks. You can add your own.
Custom predictor escape hatchWhen a built-in runtime doesn’t fit (or has no arm64 image), drop in any container. KServe just creates the Knative Service around it.
InferenceGraph CRDChain pre-processor + model + post-processor in a single graph. Useful for ensembles or when you need a transformer in front of a model.
Observability for freePrometheus metrics exposed on /metrics from the queue-proxy. RPS, latency p50/p95/p99, request volume per revision.
Traffic splittingKnative gives you traffic blocks - send 90% to revision-N, 10% to revision-N+1. Standard canary pattern, no service mesh CRDs needed.

What KServe does less well#

LimitationDetail
Stack heavinessKServe Serverless mode requires Knative + Istio + cert-manager + the KServe stack itself. That’s 30+ pods of control plane before you serve a single model. RawDeployment mode trades scale-to-zero for a lighter stack.
Three Helm chartsAs of v0.18 you install kserve-crd, kserve-resources, and kserve-runtime-configs separately. Easy to skip one and not realize until your ISVC fails to reconcile.
Controller startup orderingKServe controller caches Knative-availability at startup. Install Knative after the controller and reconciliation silently breaks. Fix is a rollout restart.
arm64 LLM image gapkserve/huggingfaceserver is amd64-only at v0.18.0. On Apple Silicon the built-in vLLM runtime hits ImagePullBackOff. The custom-predictor pattern works around this but loses the KServe-native model loader.
min-scale annotation overrideAnnotations on the ISVC metadata are silently overridden by KServe’s component defaults. Use spec.predictor.minReplicas instead. The docs do not flag this clearly.
GPU scheduling complexityKServe doesn’t solve GPU sharing or fractional GPU scheduling. You need Volcano, Kueue, or NVIDIA’s GPU Operator on top.
Documentation lagThe docs sometimes lag the code - v0.17 vs v0.18 vs nightly examples are mixed across the website. The kserve-deps.env file in the source is the most reliable version-pin reference.
Knative + Istio sticker shockIf you don’t already run Knative or Istio, adopting them just for KServe is a big lift. RawDeployment mode is more accessible.

When to use KServe#

Use it when:#

  • You’re serving multiple model frameworks and want one platform pattern instead of one per framework
  • You need scale-to-zero for predictive ML to control cost on rarely-used models
  • You want OpenAI-compatible LLM serving without writing your own gateway in front of vLLM
  • You’re already running Knative or Istio the marginal cost is much lower
  • You need traffic splitting / canary rollouts for model versions
  • You’re on amd64 / GPU nodes the built-in HF runtime is the smoothest path

Consider alternatives when:#

  • You only serve LLMs and want the lightest stack - run Ollama, vLLM, or TGI directly as a Deployment+Service. No KServe, no Knative.
  • You’re on Apple Silicon for development and want the built-in HF runtime - wait for arm64 images, or use the custom-predictor pattern with Ollama like this demo does.
  • You don’t want Knative or Istio - try Seldon Core v2 (which uses Kafka and Envoy directly) or BentoML.
  • Your scale is one model, one team - the value of a multi-runtime, multi-team platform is small. A Flask app behind a Service is fine.
  • You need fractional GPU scheduling - layer Volcano or Kueue on top, or use Run:ai as the scheduler.

Production considerations#

The demo runs on a laptop with everything pinned to a single inference node. A real platform looks different in a few key ways.

GPU nodepools#

For LLM serving in production you want a GPU nodepool with nvidia.com/gpu resource limits, the NVIDIA GPU Operator installed, and the predictor pod requesting nvidia.com/gpu: "1". The KServe HF runtime on GPU uses vLLM’s PagedAttention and KV-cache offload, which is dramatically faster than CPU inference. A 7B-parameter model on an A10 typically serves 50-200 tokens/sec; on CPU you’re looking at 1-5 tokens/sec.

Model caching#

Cold-starting an LLM pod re-pulls the model (or downloads it from hf://). KServe v0.17+ added the LocalModelCache CRD which caches models on each node’s local disk so the second cold-start on the same node is fast. For demo purposes, persist models in a PVC mounted at /root/.ollama (Ollama) or /mnt/models (HF runtime).

Autoscaling on the right metric#

Knative’s default autoscaler scales on concurrency (in-flight requests). For LLM serving this is the wrong metric - one user can hold a connection open for 30s while generating tokens. Switch to the HPA-class autoscaler and scale on token throughput or queue depth via Prometheus Adapter.

Authentication#

KServe doesn’t ship auth. Front it with oauth2-proxy (which also has an arm64 image, by the way) for OIDC, or use the Envoy AI Gateway integration that KServe added in v0.18 for AI-aware routing and rate limiting.

Observability#

The queue-proxy exposes /metrics on port 9090. Scrape it with Prometheus, build dashboards on revision_request_count, revision_request_latencies, and request_concurrency. The KServe controller exposes its own metrics on port 8080 - reconcile loop latency, controller errors, ISVC count by status.

Cost#

A laptop demo costs nothing. A production cluster running 24/7 GPU pods at always-on is expensive. Three knobs:

  1. minReplicas: 0 for predictive ML - even small CPU pods at 24/7 add up. Scale to zero between requests.
  2. GPU node pools with autoscaling - cluster-autoscaler or Karpenter scaling the node pool itself, not just the pods.
  3. Right-size the model - serving Llama-70B for a chatbot that mostly asks “what’s the capital of France?” is wasteful. Smaller distilled models often suffice.

Conclusion#

KServe is now a CNCF incubating project, and after running it through a real laptop demo it earns the badge. The InferenceService CRD is the right level of abstraction. Predictive ML and GenAI through the same controller, the same observability, the same autoscaling primitives - that’s exactly what a Kubernetes-native ML platform should look like.

The arm64 gap on the HF runtime is real but the custom-predictor pattern works around it cleanly. The three controller-restart and chart-split gotchas cost a debugging session each but are all in the Taskfile now so the next person doesn’t trip on them. Total time from git clone to live LLM chat: ~20 minutes.

If you’re standing up an ML serving platform on Kubernetes in 2026 and you’re not already committed to Seldon or BentoML, KServe is the obvious starting point. Especially now that it has the CNCF stamp and the LLM features are first-class.

The full demo (Flux manifests, Taskfile, Mermaid diagrams, ISVC manifests for sklearn, Ollama, and the reference HF runtime config for amd64 hosts) is at github.com/nicknikolakakis/srekubecraft-demo/tree/main/kserve.

KServe

EOF · 20 min · 4183 words
$ continue exploring
Dapr - Building a Safe Platform for Citizen Developer Apps on Kubernetes // How to use Dapr, Flux, Harbor, and a shared Helm chart to safely host apps built by non-technical citizen developers using LLMs on Kubernetes. #sre #kubernetes #dapr
// author
Nikos Nikolakakis
Nikos Nikolakakis Principal SRE & Platform Engineer // Writing about Kubernetes, SRE practices, and cloud-native infrastructure
$ exit logout connection closed. cd ~/home ↵
ESC
Type to search...