fix: monitoring stack reliability — cadvisor restart, flask backend resilience, pgwatch race condition
## Summary Production monitoring stack has several reliability issues causing container metrics loss, empty Grafana pgss panels, and intermittent metric gaps. Discovered during troubleshooting of missing pg_stat_statements data in Grafana. ## Issues ### 1. `self-cadvisor` missing restart policy (container stays dead after crash/reboot) **File:** `docker-compose.yml` (self-cadvisor service, ~line 249) **Problem:** `self-cadvisor` is the **only long-running service** without `restart: unless-stopped`. All other services (pgwatch-postgres, pgwatch-prometheus, grafana, monitoring_flask_backend, self-node-exporter, self-postgres-exporter) have it. When cadvisor crashes or the host reboots, it stays dead permanently. **Impact:** No container CPU/memory/IO metrics in the self-monitoring dashboard. VictoriaMetrics scrape target shows `DOWN` with DNS error since the container isn't on the network. **Fix:** ```yaml self-cadvisor: ... command: - "--housekeeping_interval=30s" - "--docker_only=true" - "--disable_metrics=percpu,sched,tcp,udp,hugetlb,referenced_memory,cpu_topology,resctrl" - "--store_container_labels=false" restart: unless-stopped # <-- ADD THIS ``` --- ### 2. Grafana pgss panels have hard dependency on `monitoring_flask_backend` **Files:** `config/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json`, `Dashboard_3_Single_query_analysis.json` **Problem:** Every pg_stat_statements panel uses a mandatory `group_left()` join with `pgwatch_query_info`: ```promql topk($top_n, irate(pgwatch_pg_stat_statements_calls{...}[$__rate_interval])) * on(queryid) group_left(displayname, ...) pgwatch_query_info ``` When `monitoring_flask_backend` (which serves `/query_info_metrics`) is down, `pgwatch_query_info` series go stale, and the multiplication returns **empty results**. This makes ALL pgss panels show nothing — even though the raw pgss data is fully present in VictoriaMetrics. **Impact:** Complete loss of pgss visualization whenever the flask backend is unhealthy, even temporarily. This is a silent failure — no errors shown, panels just appear empty. **Fix options (pick one):** - **Option A (recommended):** Use `or` fallback — show data without display names when query_info is unavailable: ```promql (topk(...) * on(queryid) group_left(displayname, ...) pgwatch_query_info) or topk(...) ``` - **Option B:** Use `group_left()` with `ignoring()` and make the join optional via recording rules - **Option C:** Add a health check / readiness probe to `monitoring_flask_backend` and alert when it's down --- ### 3. pgwatch prometheus sink has a race condition in `Collect()` (upstream pgwatch v3.7.0) **File:** upstream `internal/sinks/prometheus.go` (pgwatch v3.7.0) **Problem:** The `Collect()` method iterates the metrics cache (`promAsyncMetricCache`) **without holding the lock**, while `Write()` concurrently modifies the same inner maps under the lock. This is a Go data race that can cause: - Missed metric families (observed: `[count:54]` instead of expected `[count:117274]`) - Duplicate label set collisions (observed with `lock_waits` metrics: "collected before with the same name and label values") - Potential runtime panic (`concurrent map iteration and map write`) **Relevant code (Collect method, sinks/prometheus.go:170-185):** ```go // Iterates WITHOUT lock: for dbname, metricsMessages := range promAsyncMetricCache { for metric, metricMessages := range metricsMessages { // ... processes metrics } // Only locks for cache clear: promAsyncMetricCacheLock.Lock() promAsyncMetricCache[dbname] = make(map[string]metrics.MeasurementEnvelope) promAsyncMetricCacheLock.Unlock() } ``` **Impact:** Intermittent — most scrapes produce full data (117K+ metrics), but occasionally a scrape returns drastically fewer metrics, creating gaps in time series. Container recreation temporarily resolves it (new cache, clean state). **Fix:** In our custom pgwatch image (`pgwatch/Dockerfile`), patch `Collect()` to snapshot the cache under the lock before iterating: ```go func (promw *PrometheusWriter) Collect(ch chan<- prometheus.Metric) { // ... setup ... // Snapshot and clear cache under lock promAsyncMetricCacheLock.Lock() snapshot := promAsyncMetricCache promAsyncMetricCache = make(map[string]map[string]metrics.MeasurementEnvelope) for db := range snapshot { promAsyncMetricCache[db] = make(map[string]metrics.MeasurementEnvelope) } promAsyncMetricCacheLock.Unlock() // Iterate snapshot without lock (no concurrent writes possible) for _, metricsMessages := range snapshot { for _, metricMessages := range metricsMessages { promMetrics := promw.MetricStoreMessageToPromMetrics(metricMessages) rows += len(promMetrics) for _, pm := range promMetrics { ch <- pm } } } // ... } ``` Also consider reporting this upstream to [cybertec-postgresql/pgwatch](https://github.com/cybertec-postgresql/pgwatch/issues). --- ### 4. `monitoring_flask_backend` health observability **Problem:** When `monitoring_flask_backend` goes down (crash, OOM, hang), there's no alert or visible indicator. The only symptom is empty Grafana panels, which looks like "no data" rather than "service failure". **Fix:** Add a VictoriaMetrics alert rule for the `query-info` scrape target being down: ```yaml # config/prometheus/alerts.yml groups: - name: monitoring-stack-health rules: - alert: QueryInfoEndpointDown expr: up{job="query-info"} == 0 for: 10m annotations: summary: "monitoring_flask_backend is down — pgss Grafana panels will be empty" ``` ## Priority 1. **P0** — cadvisor restart policy (trivial one-line fix) 2. **P0** — Grafana pgss query resilience (prevents silent data loss in dashboards) 3. **P1** — pgwatch race condition patch (intermittent, self-resolving on restart) 4. **P2** — flask backend health alerting (defense in depth) --- ## Acceptance Criteria - `self-cadvisor` service has `restart: unless-stopped` in docker-compose.yml - Grafana pgss panels use `or` fallback to show data without display names when `monitoring_flask_backend` is down - pgwatch Prometheus sink `Collect()` method patched to snapshot cache under lock before iterating (race condition fix) - VictoriaMetrics alert rule added for `QueryInfoEndpointDown` when `up{job="query-info"} == 0` for 10+ minutes - `monitoring_flask_backend` health is observable via scrape target status ## Definition of Done - P0: cadvisor restart policy added — survives container crashes and host reboots - P0: Grafana pgss queries resilient — panels show data (with or without display names) regardless of flask backend status - P1: pgwatch race condition patched in custom Docker image — no more intermittent metric count drops during scrape - P2: Alert rule deployed — team notified when flask backend is down - All fixes tested in staging monitoring stack: cadvisor auto-restarts, pgss panels degrade gracefully, metrics consistent across scrapes - Race condition fix considered for upstream contribution to cybertec-postgresql/pgwatch
issue