Open-Sourcing the OpenClaw Kubernetes Operator
Two weeks ago, OpenClaw had 9,000 GitHub stars. Today it has 183,000. In between, a cottage industry appeared. ClawSimple, Kilo Claw, StartClaw, ShipClaw, GetClaw.ai, LobsterLair. One comparison site counts 33 providers and the list keeps growing. OpenClaw wrappers are invading TrustMRR. DigitalOcean shipped a 1-Click Deploy. Cloudflare adapted it into a Workers runtime. People were buying Mac Minis to run personal AI agents at home.
If you want to self-host in the cloud, the most common setup is a $5 Hetzner VPS with 2 vCPUs and 4 GB RAM, Docker Compose, and an API key. Run the onboarding wizard, point it at Anthropic or OpenRouter, connect Telegram. Done in thirty minutes. Or buy a Mac Mini, run it under your desk, and call it your personal JARVIS. The M4 draws seven watts at idle and doubles as a local inference box if you spring for the 64GB Pro model.
Both of these work for a single agent. But when I started building OpenClaw.rocks, the goal was to offer the most reliable and secure way to host OpenClaw at scale, while keeping it effortless for the user. That required a different foundation.
Why I chose Kubernetes
I used to run blockchain infrastructure at Binance, including securing the Bitcoin nodes. When your job is keeping high-value workloads isolated, observable, and recoverable at scale, you develop strong opinions about how infrastructure should work. Kubernetes is what I trust for that.
OpenClaw is a single-user application. It’s a personal assistant, not a multi-tenant platform. If you want to run agents for ten people, you need ten instances. For a hundred people, a hundred instances. Each with its own config, its own secrets, its own storage, its own network boundaries. The isolation requirements alone rule out anything less than proper container orchestration.
I’m not going to argue that Kubernetes is simple. It isn’t. For a single agent, it’s absurd overkill. But for running many agents for many people, it solves problems that nothing else solves as well. And I think any company that ends up running OpenClaw agents at scale will arrive at the same conclusion.
Isolation that’s actually enforced. Each agent runs in its own namespace with a NetworkPolicy that defaults to deny-all. Agent A can’t talk to Agent B. Agent B can’t reach Agent A’s secrets. This isn’t a convention or a best practice. It’s enforced by the container runtime and the CNI. On a shared VPS with Docker Compose, network isolation between containers requires manual iptables rules that nobody maintains.
Resource limits that prevent cascading failures. An OpenClaw agent with browser automation can consume 3 CPU cores and 6 GB of memory if you let it. On a VPS with four agents, one runaway Chromium process kills the other three. Kubernetes enforces CPU and memory limits per container. One agent hitting its ceiling doesn’t affect its neighbors.
Self-healing without SSH. When a VPS process crashes, something needs to notice and restart it. systemd does this, but only for the host. Docker Compose has restart policies, but they don’t cover the ten other things that can go wrong: OOM kills, node failures, storage issues. Kubernetes restarts failed containers, reschedules pods when nodes die, and runs health probes that detect application-level problems, not just process exits.
Scaling without guessing. We run agent workloads on a dedicated node pool. When demand increases, the cluster autoscaler adds nodes. When it decreases, nodes are drained and removed. We don’t maintain a fleet of pre-provisioned VPS instances hoping we sized it right. The infrastructure matches the actual load.
Declarative state with no drift. An agent’s entire configuration lives in one custom resource: the model, the channels, the resource limits, the network rules, the storage, the security context. There’s no SSH history to reconstruct, no manual edits to track, no configuration drift between what you think is running and what’s actually running.
None of these matter for one agent on one machine. All of them matter when you’re responsible for other people’s agents running reliably.
The operator
Kubernetes gives you the primitives. An operator is what makes them usable.
Without an operator, deploying an OpenClaw instance on Kubernetes means writing eleven resources by hand: Deployment, Service, ConfigMap, PVC, ServiceAccount, Role, RoleBinding, NetworkPolicy, PodDisruptionBudget, Ingress, ServiceMonitor. With an operator, it’s one:
apiVersion: openclaw.openclaw.io/v1alpha1
kind: OpenClawInstance
metadata:
name: my-agent
spec:
envFrom:
- secretRef:
name: my-api-keys
storage:
persistence:
enabled: true
size: 10Gi
The operator watches for this custom resource and creates everything else. It’s a control loop: on every change to any owned resource, and at least once every five minutes as a safety net, it compares desired state to actual state and reconciles the difference. If someone deletes a NetworkPolicy, it comes back. If a Deployment drifts, it’s corrected. Delete the custom resource, and owner references cascade the cleanup. No orphaned Services, no leftover PVCs.
Today we’re open-sourcing it: github.com/OpenClaw-rocks/k8s-operator.
Security by default, not by checklist
SecurityScorecard found 135,000 OpenClaw instances exposed to the public internet last week. CVE-2026-25253 demonstrated one-click remote code execution through gateway token exfiltration. Gartner recommended organizations block it entirely. 341 malicious skills were found in the ClawHub registry.
This is the reality of running AI agents in 2026. OpenClaw’s default configuration binds to 0.0.0.0 with no authentication. On a VPS without a properly configured firewall, you’re one port scan away from giving a stranger shell access to your server.
The operator takes the opposite approach. Security is structural, not optional:
- Non-root by default. UID 1000, all Linux capabilities dropped, seccomp
RuntimeDefault. A validating webhook rejects any spec that setsrunAsUser: 0. You’d have to remove the webhook to run as root. - Network isolation by default. Default-deny NetworkPolicy on every instance. Ingress: same namespace only. Egress: DNS and HTTPS only. Everything else is blocked unless you explicitly open it.
- Least-privilege RBAC. Each instance gets its own ServiceAccount with a Role that only grants
getandwatchon its own ConfigMap. An agent can’t read another agent’s secrets, config, or state. - The operator itself runs as UID 65532 (distroless nonroot), read-only root filesystem, all caps dropped, HTTP/2 disabled to mitigate CVE-2023-44487.
All of this is on by default. You get it without thinking about it.
Browser automation as a sidecar
OpenClaw agents can browse the web. On a VPS, that means running a Chromium process alongside the agent and hoping they don’t fight over resources. The operator handles this as a proper sidecar:
spec:
chromium:
enabled: true
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "2Gi"
The operator adds a Browserless Chromium container to the pod, wires up Chrome DevTools Protocol on port 9222, injects CHROMIUM_URL=ws://localhost:9222 into the main container, and gives Chromium its own security context (UID 999, caps dropped), its own resource limits, and a memory-backed /dev/shm. The two containers communicate over localhost inside the pod. No network hop, no extra Service, no security exposure. On OpenClaw.rocks, we enable the Chromium sidecar by default for every instance.
What’s in the repo
Written in Go 1.24 with controller-runtime (Kubebuilder pattern). Apache 2.0 licensed.
- Full CRD with 127KB of OpenAPI validation schema
- Helm chart (also as OCI artifact on GHCR)
- Kustomize overlays for those who prefer it
- Grafana dashboard and Prometheus alerts in
docs/monitoring/ - E2E tests on Kind in CI
- Multi-arch builds (amd64/arm64)
Install:
helm install openclaw-operator \
oci://ghcr.io/openclaw-rocks/charts/openclaw-operator \
--namespace openclaw-operator-system \
--create-namespace
Deploy an agent:
apiVersion: openclaw.openclaw.io/v1alpha1
kind: OpenClawInstance
metadata:
name: my-agent
spec:
config:
raw:
agents:
defaults:
model:
primary: "anthropic/claude-sonnet-4-20250514"
envFrom:
- secretRef:
name: my-api-keys
chromium:
enabled: true
storage:
persistence:
enabled: true
Browser automation, persistent storage, network isolation, health monitoring, automatic config rollouts. One resource. kubectl apply.
Why open-source
I built this operator to solve my own problem. I run a hosting platform for OpenClaw agents, and I needed production-grade Kubernetes tooling for it. The operator is the result of that work.
But I also think that any company running OpenClaw agents at scale will end up on Kubernetes, and they’ll face the same problems I did: the security defaults, the NetworkPolicy wiring, the Chromium sidecar, the config rollouts. The ecosystem is two weeks old and already fragmented. Everyone is solving these problems independently.
The cost of building software is approaching zero. The operator is not my moat. If anything is, it’s the brand and the trust I build by sharing work like this. And if not even that holds true, I’m having fun building, writing, and sharing it. That’s enough. Keeping the operator proprietary would mean every infrastructure team rediscovers the same gotchas I did. That’s waste, not competitive advantage.
We run this operator in production. Every agent on OpenClaw.rocks goes through it.
The code is at github.com/OpenClaw-rocks/k8s-operator. Issues and PRs welcome.
If you’d rather not operate it yourself, that’s what OpenClaw.rocks is for.