Monitoring

The OpenClaw Operator exposes Prometheus metrics and ships pre-built Grafana dashboards for fleet-wide and per-instance observability.

Prometheus Metrics

The operator exposes metrics at /metrics on its metrics service. Key metrics include:

openclaw_reconcile_total - Total reconciliation attempts by result (success/error)
openclaw_reconcile_duration_seconds - Histogram of reconciliation durations
openclaw_instance_phase - Current phase gauge per instance

ServiceMonitor

Enable ServiceMonitor for automatic Prometheus discovery:

spec:
  observability:
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        interval: 15s
        labels:
          release: prometheus

Prometheus Alerts

The operator can deploy a PrometheusRule with 7 pre-configured alerts:

Alert	Severity	Description
`OpenClawReconcileErrors`	warning	Reconciliation failures increasing
`OpenClawInstanceDegraded`	warning	Instance in Failed/Degraded for 5+ minutes
`OpenClawSlowReconciliation`	warning	p99 reconciliation > 30 seconds
`OpenClawPodCrashLooping`	critical	Pod restarting 2+ times in 10 minutes
`OpenClawPodOOMKilled`	critical	Container killed by OOM
`OpenClawPVCNearlyFull`	warning	PVC usage > 80%
`OpenClawAutoUpdateRollback`	warning	Auto-update rollback triggered

Each alert links to a dedicated runbook for diagnosis and mitigation.

Grafana Dashboards

The operator ships two Grafana dashboards as ConfigMaps with the grafana_dashboard label for automatic sidecar discovery:

Fleet Overview Dashboard

Provides a bird’s-eye view of all managed instances:

Reconciliation success/error rates
Reconciliation duration percentiles
Instance count by phase
Workqueue depth and processing rate
Auto-update status across the fleet

Instance Detail Dashboard

Per-instance deep dive (select instance via variable):

CPU and memory usage vs limits
Storage usage and PVC capacity
Network I/O
Pod restart count and health
Container-level resource breakdown (main, chromium, ollama)

Logging

Configure structured JSON logging for the operator and managed instances:

spec:
  observability:
    logging:
      level: info
      format: json

Log levels: debug, info, warn, error.