Monitoring

The OpenClaw Operator exposes Prometheus metrics and ships pre-built Grafana dashboards for fleet-wide and per-instance observability.

Prometheus Metrics

The operator exposes metrics at /metrics on its metrics service. Key metrics include:

  • openclaw_reconcile_total - Total reconciliation attempts by result (success/error)
  • openclaw_reconcile_duration_seconds - Histogram of reconciliation durations
  • openclaw_instance_phase - Current phase gauge per instance

ServiceMonitor

Enable ServiceMonitor for automatic Prometheus discovery:

spec:
  observability:
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true
        interval: 15s
        labels:
          release: prometheus

Prometheus Alerts

The operator can deploy a PrometheusRule with 7 pre-configured alerts:

AlertSeverityDescription
OpenClawReconcileErrorswarningReconciliation failures increasing
OpenClawInstanceDegradedwarningInstance in Failed/Degraded for 5+ minutes
OpenClawSlowReconciliationwarningp99 reconciliation > 30 seconds
OpenClawPodCrashLoopingcriticalPod restarting 2+ times in 10 minutes
OpenClawPodOOMKilledcriticalContainer killed by OOM
OpenClawPVCNearlyFullwarningPVC usage > 80%
OpenClawAutoUpdateRollbackwarningAuto-update rollback triggered

Each alert links to a dedicated runbook for diagnosis and mitigation.

Grafana Dashboards

The operator ships two Grafana dashboards as ConfigMaps with the grafana_dashboard label for automatic sidecar discovery:

Fleet Overview Dashboard

Provides a bird’s-eye view of all managed instances:

  • Reconciliation success/error rates
  • Reconciliation duration percentiles
  • Instance count by phase
  • Workqueue depth and processing rate
  • Auto-update status across the fleet

Instance Detail Dashboard

Per-instance deep dive (select instance via variable):

  • CPU and memory usage vs limits
  • Storage usage and PVC capacity
  • Network I/O
  • Pod restart count and health
  • Container-level resource breakdown (main, chromium, ollama)

Logging

Configure structured JSON logging for the operator and managed instances:

spec:
  observability:
    logging:
      level: info
      format: json

Log levels: debug, info, warn, error.