chore: right-size self-cadvisor resource limits

What

Right-size self-cadvisor container resources, mirroring the !238 (merged) pattern that bumped monitoring_flask_backend.

Docker Compose (docker-compose.yml)

Before After
CPU limit 0.15 cores 0.25 cores
Memory limit 192 MiB (192m) 384 MiB (402653184 bytes)

Helm (postgres_ai_helm/values.yaml)

Previously resources: {} (no requests or limits). Now:

  • Requests: CPU 100m, memory 192Mi
  • Limits: CPU 250m, memory 384Mi

The existing templates/cadvisor-daemonset.yaml already wires .Values.cadvisor.resources through, so no template change is needed.

Why

self-cadvisor was observed reporting Up N (unhealthy) on a real monitoring VM (mon-stardex, us-east-2) with 11+ containers under steady-state load. cAdvisor walks every cgroup on each housekeeping pass; RSS scales with container count and the metric set. The 192 MiB cap is at the edge for a typical monitoring host, so a conservative bump to 384 MiB removes the immediate memory-pressure failure mode without significantly overcommitting.

The CPU bump from 0.15 to 0.25 follows the same conservative direction; cAdvisor is bursty during housekeeping passes and the prior 0.15 cap could throttle scrape responses.

Out of scope: HPA/VPA, environment-specific overrides, and cAdvisor flag tuning (e.g. extra --disable_metrics entries). This MR only raises the default resources in Compose and Helm.

Validation

  • python3 -c "import yaml; yaml.safe_load(open('docker-compose.yml'))" — exit 0
  • python3 -c "import yaml; yaml.safe_load(open('postgres_ai_helm/values.yaml'))" — exit 0
  • helm not available locally; templates/cadvisor-daemonset.yaml already references .Values.cadvisor.resources via {{- with ... }} so the new block renders without template changes.
  • Compose syntax check skipped locally (docker compose config --quiet requires VM_AUTH_PASSWORD env in this tree); YAML parse confirms structure.

Follow-up

Profiling cAdvisor's actual peak RSS under steady-state load (11 containers, ~6 metric scrapes/min) on a representative monitoring VM is recommended; the 384 MiB / 0.25 CPU values are a conservative bump and should be revisited if profiling shows a different ceiling, or if --disable_metrics flags can be added to reduce footprint instead of raising limits.

Closes #172 (closed) Related: !238 (merged)

Merge request reports

Loading