Parameterize monitoring stack resource limits and VictoriaMetrics tuning via env vars
## Summary Make monitoring stack resource limits, VictoriaMetrics tuning flags, and other production-sensitive settings **environment-variable-driven in `docker-compose.yml`**, with laptop-friendly defaults preserved. Provisioning playbooks (Ansible-side, console-side) can then export prod-grade values per VM size class without forking the compose file or in-place-editing it on every host. ## Why now On 2026-04-28 the `mon-stardex-6120-...` (us-east-2) monitoring VM wedged: Grafana failed to load, VictoriaMetrics threw 422/429 under real Supabase-scale `pg_stat_statements` cardinality. Tunings that fixed it were applied **directly to the live host's `docker-compose.yml`** with no upstream home: | Service / flag | Stock default (laptop-OK) | Production-validated on `mon-stardex` | |---|---|---| | `sink-prometheus` `mem_limit` | `1536m` | **`8589934592` (8 GiB)** | | `sink-prometheus` `cpus` | `0.75` | **`1.5`** | | `-memory.allowedPercent` | `60` (VM default) | **`80`** | | `-search.maxConcurrentRequests` | `2` (VM default, scales with GOMAXPROCS) | **`8`** | | `-search.maxQueueDuration` | `10s` (VM default) | **`30s`** | | `monitoring_flask_backend` `mem_limit` | `1073741824` (1 GiB, post-https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238) | (same — already merged) | | `self-cadvisor` `mem_limit` | `402653184` (384 MiB, post-https://gitlab.com/postgres-ai/postgresai/-/merge_requests/248) | (same — already merged) | These tunings exist **only on this single VM**. Any newly-provisioned monitoring VM via the console gets stock defaults and will hit the same wedge under the same workload. ## Proposed fix ### Phase 1 (this MR) — parameterize compose Change `config/docker-compose.yml` so resource caps and VictoriaMetrics tuning flags read env vars with sensible defaults. Pattern (mirror what https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238 did for the helm chart): ```yaml sink-prometheus: cpus: ${SINK_PROMETHEUS_CPUS:-0.75} mem_limit: ${SINK_PROMETHEUS_MEM:-1610612736} # 1.5 GiB default command: - "-storageDataPath=/victoria-metrics-data" - "-retentionPeriod=${VM_RETENTION:-336h}" - "-httpListenAddr=:9090" - "-promscrape.config=/postgres_ai_configs/prometheus/prometheus.yml" - "-promscrape.config.strictParse=false" - "-promscrape.maxScrapeSize=128000000" - "-memory.allowedPercent=${VM_MEM_PCT:-60}" - "-search.maxConcurrentRequests=${VM_MAX_CONCURRENT:-2}" - "-search.maxQueueDuration=${VM_QUEUE_DURATION:-10s}" monitoring_flask_backend: cpus: ${FLASK_CPUS:-0.5} mem_limit: ${FLASK_MEM:-1073741824} self-cadvisor: cpus: ${CADVISOR_CPUS:-0.25} mem_limit: ${CADVISOR_MEM:-402653184} ``` Apply the same pattern to `pgwatch-postgres`, `pgwatch-prometheus`, `target-db`, `sink-postgres`, `postgres-reports`, etc., for completeness. Pick env-var names per a consistent convention (`<UPPER_SERVICE>_MEM`, `<UPPER_SERVICE>_CPUS`). Update `.env.example` to enumerate every new variable with its default. Add contract tests under `tests/compliance_vectors/` (mirror `test_flask_resources.py`, `test_cadvisor_resources.py`) that: 1. Assert each variable resolves to its documented default when unset. 2. Assert each variable is overridable via env (e.g. set `SINK_PROMETHEUS_MEM=8589934592` and confirm `docker compose config` renders it). ### Phase 2 (separate MR / followup) The provisioning playbook (`deploy_monitoring.yml` per [`disaster-recovery/PostgresAI - Infrastructure Description.md`](https://gitlab.com/postgres-ai/infra/-/blob/main/disaster-recovery/PostgresAI%20-%20Infrastructure%20Description.md)) writes a `.env` to the host with prod values per VM size class. Out of scope for this issue; tracked separately when this lands. ## Test plan - [ ] Existing `tests/compliance_vectors/test_flask_resources.py` and `test_cadvisor_resources.py` still pass against the parameterized compose (defaults unchanged). - [ ] New tests: each env var overrides the rendered limit/flag. - [ ] `docker compose config --quiet` exits 0 with no env vars set. - [ ] `docker compose config` with prod values exported renders the expected limits. - [ ] No behavioral change for local-dev users who don't set any of the new env vars. ## Related - https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238 — established the helm-side pattern for `flask.resources`. This issue extends the pattern to compose-side, for all services. - https://gitlab.com/postgres-ai/postgresai/-/merge_requests/248 — same pattern for cAdvisor (helm-side already merged). - Production-validated tunings come from the 2026-04-28 incident response (live VM, us-east-2).
issue