Parameterize monitoring stack resource limits and VictoriaMetrics tuning via env vars
## Summary
Make monitoring stack resource limits, VictoriaMetrics tuning flags, and other production-sensitive settings **environment-variable-driven in `docker-compose.yml`**, with laptop-friendly defaults preserved. Provisioning playbooks (Ansible-side, console-side) can then export prod-grade values per VM size class without forking the compose file or in-place-editing it on every host.
## Why now
On 2026-04-28 the `mon-stardex-6120-...` (us-east-2) monitoring VM wedged: Grafana failed to load, VictoriaMetrics threw 422/429 under real Supabase-scale `pg_stat_statements` cardinality. Tunings that fixed it were applied **directly to the live host's `docker-compose.yml`** with no upstream home:
| Service / flag | Stock default (laptop-OK) | Production-validated on `mon-stardex` |
|---|---|---|
| `sink-prometheus` `mem_limit` | `1536m` | **`8589934592` (8 GiB)** |
| `sink-prometheus` `cpus` | `0.75` | **`1.5`** |
| `-memory.allowedPercent` | `60` (VM default) | **`80`** |
| `-search.maxConcurrentRequests` | `2` (VM default, scales with GOMAXPROCS) | **`8`** |
| `-search.maxQueueDuration` | `10s` (VM default) | **`30s`** |
| `monitoring_flask_backend` `mem_limit` | `1073741824` (1 GiB, post-https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238) | (same — already merged) |
| `self-cadvisor` `mem_limit` | `402653184` (384 MiB, post-https://gitlab.com/postgres-ai/postgresai/-/merge_requests/248) | (same — already merged) |
These tunings exist **only on this single VM**. Any newly-provisioned monitoring VM via the console gets stock defaults and will hit the same wedge under the same workload.
## Proposed fix
### Phase 1 (this MR) — parameterize compose
Change `config/docker-compose.yml` so resource caps and VictoriaMetrics tuning flags read env vars with sensible defaults. Pattern (mirror what https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238 did for the helm chart):
```yaml
sink-prometheus:
cpus: ${SINK_PROMETHEUS_CPUS:-0.75}
mem_limit: ${SINK_PROMETHEUS_MEM:-1610612736} # 1.5 GiB default
command:
- "-storageDataPath=/victoria-metrics-data"
- "-retentionPeriod=${VM_RETENTION:-336h}"
- "-httpListenAddr=:9090"
- "-promscrape.config=/postgres_ai_configs/prometheus/prometheus.yml"
- "-promscrape.config.strictParse=false"
- "-promscrape.maxScrapeSize=128000000"
- "-memory.allowedPercent=${VM_MEM_PCT:-60}"
- "-search.maxConcurrentRequests=${VM_MAX_CONCURRENT:-2}"
- "-search.maxQueueDuration=${VM_QUEUE_DURATION:-10s}"
monitoring_flask_backend:
cpus: ${FLASK_CPUS:-0.5}
mem_limit: ${FLASK_MEM:-1073741824}
self-cadvisor:
cpus: ${CADVISOR_CPUS:-0.25}
mem_limit: ${CADVISOR_MEM:-402653184}
```
Apply the same pattern to `pgwatch-postgres`, `pgwatch-prometheus`, `target-db`, `sink-postgres`, `postgres-reports`, etc., for completeness. Pick env-var names per a consistent convention (`<UPPER_SERVICE>_MEM`, `<UPPER_SERVICE>_CPUS`).
Update `.env.example` to enumerate every new variable with its default.
Add contract tests under `tests/compliance_vectors/` (mirror `test_flask_resources.py`, `test_cadvisor_resources.py`) that:
1. Assert each variable resolves to its documented default when unset.
2. Assert each variable is overridable via env (e.g. set `SINK_PROMETHEUS_MEM=8589934592` and confirm `docker compose config` renders it).
### Phase 2 (separate MR / followup)
The provisioning playbook (`deploy_monitoring.yml` per [`disaster-recovery/PostgresAI - Infrastructure Description.md`](https://gitlab.com/postgres-ai/infra/-/blob/main/disaster-recovery/PostgresAI%20-%20Infrastructure%20Description.md)) writes a `.env` to the host with prod values per VM size class. Out of scope for this issue; tracked separately when this lands.
## Test plan
- [ ] Existing `tests/compliance_vectors/test_flask_resources.py` and `test_cadvisor_resources.py` still pass against the parameterized compose (defaults unchanged).
- [ ] New tests: each env var overrides the rendered limit/flag.
- [ ] `docker compose config --quiet` exits 0 with no env vars set.
- [ ] `docker compose config` with prod values exported renders the expected limits.
- [ ] No behavioral change for local-dev users who don't set any of the new env vars.
## Related
- https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238 — established the helm-side pattern for `flask.resources`. This issue extends the pattern to compose-side, for all services.
- https://gitlab.com/postgres-ai/postgresai/-/merge_requests/248 — same pattern for cAdvisor (helm-side already merged).
- Production-validated tunings come from the 2026-04-28 incident response (live VM, us-east-2).
issue