fix: monitoring stack reliability — cadvisor restart, flask backend resilience, pgwatch race condition
## Summary
Production monitoring stack has several reliability issues causing container metrics loss, empty Grafana pgss panels, and intermittent metric gaps. Discovered during troubleshooting of missing pg_stat_statements data in Grafana.
## Issues
### 1. `self-cadvisor` missing restart policy (container stays dead after crash/reboot)
**File:** `docker-compose.yml` (self-cadvisor service, ~line 249)
**Problem:** `self-cadvisor` is the **only long-running service** without `restart: unless-stopped`. All other services (pgwatch-postgres, pgwatch-prometheus, grafana, monitoring_flask_backend, self-node-exporter, self-postgres-exporter) have it. When cadvisor crashes or the host reboots, it stays dead permanently.
**Impact:** No container CPU/memory/IO metrics in the self-monitoring dashboard. VictoriaMetrics scrape target shows `DOWN` with DNS error since the container isn't on the network.
**Fix:**
```yaml
self-cadvisor:
...
command:
- "--housekeeping_interval=30s"
- "--docker_only=true"
- "--disable_metrics=percpu,sched,tcp,udp,hugetlb,referenced_memory,cpu_topology,resctrl"
- "--store_container_labels=false"
restart: unless-stopped # <-- ADD THIS
```
---
### 2. Grafana pgss panels have hard dependency on `monitoring_flask_backend`
**Files:** `config/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json`, `Dashboard_3_Single_query_analysis.json`
**Problem:** Every pg_stat_statements panel uses a mandatory `group_left()` join with `pgwatch_query_info`:
```promql
topk($top_n, irate(pgwatch_pg_stat_statements_calls{...}[$__rate_interval]))
* on(queryid) group_left(displayname, ...) pgwatch_query_info
```
When `monitoring_flask_backend` (which serves `/query_info_metrics`) is down, `pgwatch_query_info` series go stale, and the multiplication returns **empty results**. This makes ALL pgss panels show nothing — even though the raw pgss data is fully present in VictoriaMetrics.
**Impact:** Complete loss of pgss visualization whenever the flask backend is unhealthy, even temporarily. This is a silent failure — no errors shown, panels just appear empty.
**Fix options (pick one):**
- **Option A (recommended):** Use `or` fallback — show data without display names when query_info is unavailable:
```promql
(topk(...) * on(queryid) group_left(displayname, ...) pgwatch_query_info)
or topk(...)
```
- **Option B:** Use `group_left()` with `ignoring()` and make the join optional via recording rules
- **Option C:** Add a health check / readiness probe to `monitoring_flask_backend` and alert when it's down
---
### 3. pgwatch prometheus sink has a race condition in `Collect()` (upstream pgwatch v3.7.0)
**File:** upstream `internal/sinks/prometheus.go` (pgwatch v3.7.0)
**Problem:** The `Collect()` method iterates the metrics cache (`promAsyncMetricCache`) **without holding the lock**, while `Write()` concurrently modifies the same inner maps under the lock. This is a Go data race that can cause:
- Missed metric families (observed: `[count:54]` instead of expected `[count:117274]`)
- Duplicate label set collisions (observed with `lock_waits` metrics: "collected before with the same name and label values")
- Potential runtime panic (`concurrent map iteration and map write`)
**Relevant code (Collect method, sinks/prometheus.go:170-185):**
```go
// Iterates WITHOUT lock:
for dbname, metricsMessages := range promAsyncMetricCache {
for metric, metricMessages := range metricsMessages {
// ... processes metrics
}
// Only locks for cache clear:
promAsyncMetricCacheLock.Lock()
promAsyncMetricCache[dbname] = make(map[string]metrics.MeasurementEnvelope)
promAsyncMetricCacheLock.Unlock()
}
```
**Impact:** Intermittent — most scrapes produce full data (117K+ metrics), but occasionally a scrape returns drastically fewer metrics, creating gaps in time series. Container recreation temporarily resolves it (new cache, clean state).
**Fix:** In our custom pgwatch image (`pgwatch/Dockerfile`), patch `Collect()` to snapshot the cache under the lock before iterating:
```go
func (promw *PrometheusWriter) Collect(ch chan<- prometheus.Metric) {
// ... setup ...
// Snapshot and clear cache under lock
promAsyncMetricCacheLock.Lock()
snapshot := promAsyncMetricCache
promAsyncMetricCache = make(map[string]map[string]metrics.MeasurementEnvelope)
for db := range snapshot {
promAsyncMetricCache[db] = make(map[string]metrics.MeasurementEnvelope)
}
promAsyncMetricCacheLock.Unlock()
// Iterate snapshot without lock (no concurrent writes possible)
for _, metricsMessages := range snapshot {
for _, metricMessages := range metricsMessages {
promMetrics := promw.MetricStoreMessageToPromMetrics(metricMessages)
rows += len(promMetrics)
for _, pm := range promMetrics {
ch <- pm
}
}
}
// ...
}
```
Also consider reporting this upstream to [cybertec-postgresql/pgwatch](https://github.com/cybertec-postgresql/pgwatch/issues).
---
### 4. `monitoring_flask_backend` health observability
**Problem:** When `monitoring_flask_backend` goes down (crash, OOM, hang), there's no alert or visible indicator. The only symptom is empty Grafana panels, which looks like "no data" rather than "service failure".
**Fix:** Add a VictoriaMetrics alert rule for the `query-info` scrape target being down:
```yaml
# config/prometheus/alerts.yml
groups:
- name: monitoring-stack-health
rules:
- alert: QueryInfoEndpointDown
expr: up{job="query-info"} == 0
for: 10m
annotations:
summary: "monitoring_flask_backend is down — pgss Grafana panels will be empty"
```
## Priority
1. **P0** — cadvisor restart policy (trivial one-line fix)
2. **P0** — Grafana pgss query resilience (prevents silent data loss in dashboards)
3. **P1** — pgwatch race condition patch (intermittent, self-resolving on restart)
4. **P2** — flask backend health alerting (defense in depth)
---
## Acceptance Criteria
- `self-cadvisor` service has `restart: unless-stopped` in docker-compose.yml
- Grafana pgss panels use `or` fallback to show data without display names when `monitoring_flask_backend` is down
- pgwatch Prometheus sink `Collect()` method patched to snapshot cache under lock before iterating (race condition fix)
- VictoriaMetrics alert rule added for `QueryInfoEndpointDown` when `up{job="query-info"} == 0` for 10+ minutes
- `monitoring_flask_backend` health is observable via scrape target status
## Definition of Done
- P0: cadvisor restart policy added — survives container crashes and host reboots
- P0: Grafana pgss queries resilient — panels show data (with or without display names) regardless of flask backend status
- P1: pgwatch race condition patched in custom Docker image — no more intermittent metric count drops during scrape
- P2: Alert rule deployed — team notified when flask backend is down
- All fixes tested in staging monitoring stack: cadvisor auto-restarts, pgss panels degrade gracefully, metrics consistent across scrapes
- Race condition fix considered for upstream contribution to cybertec-postgresql/pgwatch
issue