Monitoring
The OpenClaw Operator exposes Prometheus metrics and ships pre-built Grafana dashboards for fleet-wide and per-instance observability.
Prometheus Metrics
The operator exposes metrics at /metrics on its metrics service. Key metrics include:
openclaw_reconcile_total- Total reconciliation attempts by result (success/error)openclaw_reconcile_duration_seconds- Histogram of reconciliation durationsopenclaw_instance_phase- Current phase gauge per instance
ServiceMonitor
Enable ServiceMonitor for automatic Prometheus discovery:
spec:
observability:
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: 15s
labels:
release: prometheus
Prometheus Alerts
The operator can deploy a PrometheusRule with 7 pre-configured alerts:
| Alert | Severity | Description |
|---|---|---|
OpenClawReconcileErrors | warning | Reconciliation failures increasing |
OpenClawInstanceDegraded | warning | Instance in Failed/Degraded for 5+ minutes |
OpenClawSlowReconciliation | warning | p99 reconciliation > 30 seconds |
OpenClawPodCrashLooping | critical | Pod restarting 2+ times in 10 minutes |
OpenClawPodOOMKilled | critical | Container killed by OOM |
OpenClawPVCNearlyFull | warning | PVC usage > 80% |
OpenClawAutoUpdateRollback | warning | Auto-update rollback triggered |
Each alert links to a dedicated runbook for diagnosis and mitigation.
Grafana Dashboards
The operator ships two Grafana dashboards as ConfigMaps with the grafana_dashboard label for automatic sidecar discovery:
Fleet Overview Dashboard
Provides a bird’s-eye view of all managed instances:
- Reconciliation success/error rates
- Reconciliation duration percentiles
- Instance count by phase
- Workqueue depth and processing rate
- Auto-update status across the fleet
Instance Detail Dashboard
Per-instance deep dive (select instance via variable):
- CPU and memory usage vs limits
- Storage usage and PVC capacity
- Network I/O
- Pod restart count and health
- Container-level resource breakdown (main, chromium, ollama)
Logging
Configure structured JSON logging for the operator and managed instances:
spec:
observability:
logging:
level: info
format: json
Log levels: debug, info, warn, error.