Right-size self-cadvisor resource limits (still cpus: 0.15 / mem_limit: 192m post-!238) (#172) · Issues · PostgresAI / postgresai

Right-size self-cadvisor resource limits (still cpus: 0.15 / mem_limit: 192m post-!238)

## Summary [!238](https://gitlab.com/postgres-ai/postgresai/-/merge_requests/238) raised `monitoring_flask_backend` from `cpus: 0.1 / mem_limit: 192m` to `cpus: 0.5 / mem_limit: 1Gi` to stop a sustained gunicorn-worker OOM loop. **`self-cadvisor` still has the original `cpus: 0.15 / mem_limit: 192m`** in `docker-compose.yml`: ```yaml self-cadvisor: image: gcr.io/cadvisor/cadvisor:v0.51.0 container_name: self-cadvisor cpus: 0.15 mem_limit: 192m privileged: true ... ``` ## Why it matters - cAdvisor walks every cgroup on the host on each housekeeping pass. With 11+ containers on a monitoring VM (and Docker volume metadata for each), 192 MiB is at the edge. - Reproduced behavior on `mon-stardex` (us-east-2): `self-cadvisor` reported `Up 2 days (unhealthy)` during the disk-full incident on 2026-04-28, and dropped out of the docker network on the subsequent host stop/start. Memory pressure under load is the most likely cause of the unhealthy state. - Symptoms: Grafana panels fed by cAdvisor metrics (per-container CPU/mem, container restart counts) showing "No data" or stale data on otherwise healthy hosts. ## Proposed fix Decide on right-sizing — e.g. align with the `!238` pattern: ```yaml self-cadvisor: cpus: 0.25 mem_limit: 384m ``` 384 MiB is a conservative bump; cAdvisor's RSS scales with container count and metric set. Worth profiling actual peak RSS on a monitoring VM under steady-state load (11 containers, ~6 metric scrapes/min) before settling on a final number — drop `--disable_metrics` flags if any can be added to reduce footprint instead of just bumping limits. ## Test plan - [ ] Profile `self-cadvisor` RSS over ~1 hour on a representative monitoring VM - [ ] Bump limits in `docker-compose.yml` and helm chart values - [ ] Confirm `self-cadvisor` stays `(healthy)` for ≥ 24h, no OOM kills in `journalctl -k --grep "oom-kill"` - [ ] Confirm Grafana per-container panels populate ## Related - !238 (merged, fixes flask backend resources) - postgres-ai/postgresai#XXX (sibling issue: extend !239 restart-policy coverage to `self-cadvisor`) - Ops-side write-up: postgres-ai/infra#51 (item 5: `flask-pgss-api` OOM, fixed; this issue tracks the next-most-undersized container)

issue