fix: monitoring stack reliability — cadvisor restart, flask backend resilience, pgwatch race condition
Summary
Production monitoring stack has several reliability issues causing container metrics loss, empty Grafana pgss panels, and intermittent metric gaps. Discovered during troubleshooting of missing pg_stat_statements data in Grafana.
Issues
1. self-cadvisor missing restart policy (container stays dead after crash/reboot)
File: docker-compose.yml (self-cadvisor service, ~line 249)
Problem: self-cadvisor is the only long-running service without restart: unless-stopped. All other services (pgwatch-postgres, pgwatch-prometheus, grafana, monitoring_flask_backend, self-node-exporter, self-postgres-exporter) have it. When cadvisor crashes or the host reboots, it stays dead permanently.
Impact: No container CPU/memory/IO metrics in the self-monitoring dashboard. VictoriaMetrics scrape target shows DOWN with DNS error since the container isn't on the network.
Fix:
self-cadvisor:
...
command:
- "--housekeeping_interval=30s"
- "--docker_only=true"
- "--disable_metrics=percpu,sched,tcp,udp,hugetlb,referenced_memory,cpu_topology,resctrl"
- "--store_container_labels=false"
restart: unless-stopped # <-- ADD THIS
2. Grafana pgss panels have hard dependency on monitoring_flask_backend
Files: config/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json, Dashboard_3_Single_query_analysis.json
Problem: Every pg_stat_statements panel uses a mandatory group_left() join with pgwatch_query_info:
topk($top_n, irate(pgwatch_pg_stat_statements_calls{...}[$__rate_interval]))
* on(queryid) group_left(displayname, ...) pgwatch_query_info
When monitoring_flask_backend (which serves /query_info_metrics) is down, pgwatch_query_info series go stale, and the multiplication returns empty results. This makes ALL pgss panels show nothing — even though the raw pgss data is fully present in VictoriaMetrics.
Impact: Complete loss of pgss visualization whenever the flask backend is unhealthy, even temporarily. This is a silent failure — no errors shown, panels just appear empty.
Fix options (pick one):
-
Option A (recommended): Use
orfallback — show data without display names when query_info is unavailable:(topk(...) * on(queryid) group_left(displayname, ...) pgwatch_query_info) or topk(...) -
Option B: Use
group_left()withignoring()and make the join optional via recording rules -
Option C: Add a health check / readiness probe to
monitoring_flask_backendand alert when it's down
3. pgwatch prometheus sink has a race condition in Collect() (upstream pgwatch v3.7.0)
File: upstream internal/sinks/prometheus.go (pgwatch v3.7.0)
Problem: The Collect() method iterates the metrics cache (promAsyncMetricCache) without holding the lock, while Write() concurrently modifies the same inner maps under the lock. This is a Go data race that can cause:
- Missed metric families (observed:
[count:54]instead of expected[count:117274]) - Duplicate label set collisions (observed with
lock_waitsmetrics: "collected before with the same name and label values") - Potential runtime panic (
concurrent map iteration and map write)
Relevant code (Collect method, sinks/prometheus.go:170-185):
// Iterates WITHOUT lock:
for dbname, metricsMessages := range promAsyncMetricCache {
for metric, metricMessages := range metricsMessages {
// ... processes metrics
}
// Only locks for cache clear:
promAsyncMetricCacheLock.Lock()
promAsyncMetricCache[dbname] = make(map[string]metrics.MeasurementEnvelope)
promAsyncMetricCacheLock.Unlock()
}
Impact: Intermittent — most scrapes produce full data (117K+ metrics), but occasionally a scrape returns drastically fewer metrics, creating gaps in time series. Container recreation temporarily resolves it (new cache, clean state).
Fix: In our custom pgwatch image (pgwatch/Dockerfile), patch Collect() to snapshot the cache under the lock before iterating:
func (promw *PrometheusWriter) Collect(ch chan<- prometheus.Metric) {
// ... setup ...
// Snapshot and clear cache under lock
promAsyncMetricCacheLock.Lock()
snapshot := promAsyncMetricCache
promAsyncMetricCache = make(map[string]map[string]metrics.MeasurementEnvelope)
for db := range snapshot {
promAsyncMetricCache[db] = make(map[string]metrics.MeasurementEnvelope)
}
promAsyncMetricCacheLock.Unlock()
// Iterate snapshot without lock (no concurrent writes possible)
for _, metricsMessages := range snapshot {
for _, metricMessages := range metricsMessages {
promMetrics := promw.MetricStoreMessageToPromMetrics(metricMessages)
rows += len(promMetrics)
for _, pm := range promMetrics {
ch <- pm
}
}
}
// ...
}
Also consider reporting this upstream to cybertec-postgresql/pgwatch.
4. monitoring_flask_backend health observability
Problem: When monitoring_flask_backend goes down (crash, OOM, hang), there's no alert or visible indicator. The only symptom is empty Grafana panels, which looks like "no data" rather than "service failure".
Fix: Add a VictoriaMetrics alert rule for the query-info scrape target being down:
# config/prometheus/alerts.yml
groups:
- name: monitoring-stack-health
rules:
- alert: QueryInfoEndpointDown
expr: up{job="query-info"} == 0
for: 10m
annotations:
summary: "monitoring_flask_backend is down — pgss Grafana panels will be empty"
Priority
- P0 — cadvisor restart policy (trivial one-line fix)
- P0 — Grafana pgss query resilience (prevents silent data loss in dashboards)
- P1 — pgwatch race condition patch (intermittent, self-resolving on restart)
- P2 — flask backend health alerting (defense in depth)
Acceptance Criteria
-
self-cadvisorservice hasrestart: unless-stoppedin docker-compose.yml - Grafana pgss panels use
orfallback to show data without display names whenmonitoring_flask_backendis down - pgwatch Prometheus sink
Collect()method patched to snapshot cache under lock before iterating (race condition fix) - VictoriaMetrics alert rule added for
QueryInfoEndpointDownwhenup{job="query-info"} == 0for 10+ minutes -
monitoring_flask_backendhealth is observable via scrape target status
Definition of Done
- P0: cadvisor restart policy added — survives container crashes and host reboots
- P0: Grafana pgss queries resilient — panels show data (with or without display names) regardless of flask backend status
- P1: pgwatch race condition patched in custom Docker image — no more intermittent metric count drops during scrape
- P2: Alert rule deployed — team notified when flask backend is down
- All fixes tested in staging monitoring stack: cadvisor auto-restarts, pgss panels degrade gracefully, metrics consistent across scrapes
- Race condition fix considered for upstream contribution to cybertec-postgresql/pgwatch