config-init clobbers user edits to metrics.yml on every compose up (no idempotency)
## Summary `config/init-configs.sh` does a wholesale `cp -r` from the image into the `postgres_ai_configs` volume on every invocation. Since the script is run by the `config-init` service which is a `service_completed_successfully` dependency of multiple long-running services, **every `docker compose up` reseeds the volume and silently wipes any user edits to files like `pgwatch/metrics.yml` or `pgwatch-prometheus/metrics.yml`.** This is a serious operational footgun for anyone applying production-side metrics tuning to a deployed monitoring stack. ## Reproduction On a 0.14.0 monitoring VM (`mon-stardex-6120-...`, us-east-2) on 2026-04-29: 1. Edit volume file `/var/lib/docker/volumes/monitoring_postgres_ai_configs/_data/pgwatch-prometheus/metrics.yml` — e.g. apply the `calls >= 3 AND exec_time_total >= 1000` filter from https://gitlab.com/postgres-ai/postgresai/-/merge_requests/220. 2. `docker restart pgwatch-prometheus` — filter takes effect, dashboard works. 3. Later, run any `docker compose -f docker-compose.yml -p monitoring up -d sink-prometheus` (e.g. to bump VM memory cap). 4. `config-init` runs as a side effect, logs `Done. Copied: files: 31, directories: 12`. 5. **`metrics.yml` is now back to image defaults.** The filter is silently gone. pgwatch reverts to dumping all 5,023+ time series. Grafana panels start failing with VM 422 / 429 errors as cardinality balloons again. The current script (`config/init-configs.sh`): ```sh echo "Copying configs to $TARGET_DIR..." # Copy all configs preserving structure cp -r "$SOURCE_DIR"/* "$TARGET_DIR/" echo "Done. Copied:" ``` No check for whether the target is already populated. No version marker. No backup of existing files. ## Impact - Production-side metrics.yml tuning has zero persistence guarantee on 0.14.0. - Anyone who follows the pattern documented in https://gitlab.com/postgres-ai/postgresai/-/merge_requests/220 ("Applied on production (34.203.111.251) on 2026-02-16: ... Patched both metrics.yml files ... Result: table repopulated with 851 rows") is at risk of losing the patch the next time they touch compose for any reason. - Documentation does not mention the gotcha. ## Proposed fix Make `init-configs.sh` idempotent. Two reasonable designs: ### Option A — version-marker, allow upgrades to flow through ```sh TARGET_VERSION_FILE="${TARGET_DIR}/.pgai-configs-version" SOURCE_VERSION="$(cat /VERSION)" if [ -f "$TARGET_VERSION_FILE" ] && [ "$(cat "$TARGET_VERSION_FILE")" = "$SOURCE_VERSION" ]; then echo "Configs already initialized at version $SOURCE_VERSION; skipping." exit 0 fi echo "Initializing configs (target version: $SOURCE_VERSION)..." cp -r "$SOURCE_DIR"/* "$TARGET_DIR/" echo "$SOURCE_VERSION" > "$TARGET_VERSION_FILE" ``` - **Pros:** preserves user edits during normal operation; bug fixes still flow through on intentional upgrades. - **Cons:** if a user wants new image defaults *without* clearing the volume, they have to delete `.pgai-configs-version` manually. (Acceptable trade-off; document it.) ### Option B — only initialize when target is empty ```sh if [ -n "$(ls -A "$TARGET_DIR" 2>/dev/null)" ]; then echo "Target $TARGET_DIR already populated; skipping initialization." exit 0 fi cp -r "$SOURCE_DIR"/* "$TARGET_DIR/" ``` - **Pros:** simpler. - **Cons:** never auto-updates after first install. User has to manually wipe the volume on upgrades. **Recommended: Option A** — better UX on upgrades, while still safe by default. Document the manual-update escape hatch (`rm /target/.pgai-configs-version`) in CHANGELOG. ## Test plan - [ ] First install on empty volume — copies all files, writes version marker. - [ ] Re-run with same version — exits early, no overwrite. Hand-edit a file and confirm survives. - [ ] Re-run after image version bump (different `/VERSION`) — copies all files, overwrites edits. Document this in release notes for the version that ships this fix. - [ ] Add a unit/contract test in `tests/compliance_vectors/test_config_init.py` (or extend `tests/compliance_vectors/test_flask_resources.py` style) that sets up a fake `$SOURCE_DIR` and `$TARGET_DIR` and exercises both branches. ## Related - https://gitlab.com/postgres-ai/postgresai/-/merge_requests/220 — the production-validated metrics filter that this bug actively undermines. - https://gitlab.com/postgres-ai/infra/-/work_items/51 — ops write-up of the 2026-04-28 incident; this issue is a follow-up cause.
issue