fix: monitoring stack reliability — cadvisor restart, flask backend resilience, pgwatch race condition

Summary

Production monitoring stack has several reliability issues causing container metrics loss, empty Grafana pgss panels, and intermittent metric gaps. Discovered during troubleshooting of missing pg_stat_statements data in Grafana.

Issues

1. `self-cadvisor` missing restart policy (container stays dead after crash/reboot)

File: docker-compose.yml (self-cadvisor service, ~line 249)

Problem: self-cadvisor is the only long-running service without restart: unless-stopped. All other services (pgwatch-postgres, pgwatch-prometheus, grafana, monitoring_flask_backend, self-node-exporter, self-postgres-exporter) have it. When cadvisor crashes or the host reboots, it stays dead permanently.

Impact: No container CPU/memory/IO metrics in the self-monitoring dashboard. VictoriaMetrics scrape target shows DOWN with DNS error since the container isn't on the network.

Fix:

self-cadvisor:
  ...
  command:
    - "--housekeeping_interval=30s"
    - "--docker_only=true"
    - "--disable_metrics=percpu,sched,tcp,udp,hugetlb,referenced_memory,cpu_topology,resctrl"
    - "--store_container_labels=false"
  restart: unless-stopped   # <-- ADD THIS

2. Grafana pgss panels have hard dependency on `monitoring_flask_backend`

Files: config/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json, Dashboard_3_Single_query_analysis.json

Problem: Every pg_stat_statements panel uses a mandatory group_left() join with pgwatch_query_info:

topk($top_n, irate(pgwatch_pg_stat_statements_calls{...}[$__rate_interval]))
  * on(queryid) group_left(displayname, ...) pgwatch_query_info

When monitoring_flask_backend (which serves /query_info_metrics) is down, pgwatch_query_info series go stale, and the multiplication returns empty results. This makes ALL pgss panels show nothing — even though the raw pgss data is fully present in VictoriaMetrics.

Impact: Complete loss of pgss visualization whenever the flask backend is unhealthy, even temporarily. This is a silent failure — no errors shown, panels just appear empty.

Fix options (pick one):

Option A (recommended): Use or fallback — show data without display names when query_info is unavailable:
```
(topk(...) * on(queryid) group_left(displayname, ...) pgwatch_query_info)
  or topk(...)
```
Option B: Use group_left() with ignoring() and make the join optional via recording rules
Option C: Add a health check / readiness probe to monitoring_flask_backend and alert when it's down

3. pgwatch prometheus sink has a race condition in `Collect()` (upstream pgwatch v3.7.0)

File: upstream internal/sinks/prometheus.go (pgwatch v3.7.0)

Problem: The Collect() method iterates the metrics cache (promAsyncMetricCache) without holding the lock, while Write() concurrently modifies the same inner maps under the lock. This is a Go data race that can cause:

Missed metric families (observed: [count:54] instead of expected [count:117274])
Duplicate label set collisions (observed with lock_waits metrics: "collected before with the same name and label values")
Potential runtime panic (concurrent map iteration and map write)

Relevant code (Collect method, sinks/prometheus.go:170-185):

// Iterates WITHOUT lock:
for dbname, metricsMessages := range promAsyncMetricCache {
    for metric, metricMessages := range metricsMessages {
        // ... processes metrics
    }
    // Only locks for cache clear:
    promAsyncMetricCacheLock.Lock()
    promAsyncMetricCache[dbname] = make(map[string]metrics.MeasurementEnvelope)
    promAsyncMetricCacheLock.Unlock()
}

Impact: Intermittent — most scrapes produce full data (117K+ metrics), but occasionally a scrape returns drastically fewer metrics, creating gaps in time series. Container recreation temporarily resolves it (new cache, clean state).

Fix: In our custom pgwatch image (pgwatch/Dockerfile), patch Collect() to snapshot the cache under the lock before iterating:

func (promw *PrometheusWriter) Collect(ch chan<- prometheus.Metric) {
    // ... setup ...

    // Snapshot and clear cache under lock
    promAsyncMetricCacheLock.Lock()
    snapshot := promAsyncMetricCache
    promAsyncMetricCache = make(map[string]map[string]metrics.MeasurementEnvelope)
    for db := range snapshot {
        promAsyncMetricCache[db] = make(map[string]metrics.MeasurementEnvelope)
    }
    promAsyncMetricCacheLock.Unlock()

    // Iterate snapshot without lock (no concurrent writes possible)
    for _, metricsMessages := range snapshot {
        for _, metricMessages := range metricsMessages {
            promMetrics := promw.MetricStoreMessageToPromMetrics(metricMessages)
            rows += len(promMetrics)
            for _, pm := range promMetrics {
                ch <- pm
            }
        }
    }
    // ...
}

Also consider reporting this upstream to cybertec-postgresql/pgwatch.

4. `monitoring_flask_backend` health observability

Problem: When monitoring_flask_backend goes down (crash, OOM, hang), there's no alert or visible indicator. The only symptom is empty Grafana panels, which looks like "no data" rather than "service failure".

Fix: Add a VictoriaMetrics alert rule for the query-info scrape target being down:

# config/prometheus/alerts.yml
groups:
  - name: monitoring-stack-health
    rules:
      - alert: QueryInfoEndpointDown
        expr: up{job="query-info"} == 0
        for: 10m
        annotations:
          summary: "monitoring_flask_backend is down — pgss Grafana panels will be empty"

Priority

P0 — cadvisor restart policy (trivial one-line fix)
P0 — Grafana pgss query resilience (prevents silent data loss in dashboards)
P1 — pgwatch race condition patch (intermittent, self-resolving on restart)
P2 — flask backend health alerting (defense in depth)

Acceptance Criteria

self-cadvisor service has restart: unless-stopped in docker-compose.yml
Grafana pgss panels use or fallback to show data without display names when monitoring_flask_backend is down
pgwatch Prometheus sink Collect() method patched to snapshot cache under lock before iterating (race condition fix)
VictoriaMetrics alert rule added for QueryInfoEndpointDown when up{job="query-info"} == 0 for 10+ minutes
monitoring_flask_backend health is observable via scrape target status

Definition of Done

P0: cadvisor restart policy added — survives container crashes and host reboots
P0: Grafana pgss queries resilient — panels show data (with or without display names) regardless of flask backend status
P1: pgwatch race condition patched in custom Docker image — no more intermittent metric count drops during scrape
P2: Alert rule deployed — team notified when flask backend is down
All fixes tested in staging monitoring stack: cadvisor auto-restarts, pgss panels degrade gracefully, metrics consistent across scrapes
Race condition fix considered for upstream contribution to cybertec-postgresql/pgwatch

Edited Feb 13, 2026 by Nikolay Samokhvalov