config-init clobbers user edits to metrics.yml on every compose up (no idempotency)
## Summary
`config/init-configs.sh` does a wholesale `cp -r` from the image into the `postgres_ai_configs` volume on every invocation. Since the script is run by the `config-init` service which is a `service_completed_successfully` dependency of multiple long-running services, **every `docker compose up` reseeds the volume and silently wipes any user edits to files like `pgwatch/metrics.yml` or `pgwatch-prometheus/metrics.yml`.**
This is a serious operational footgun for anyone applying production-side metrics tuning to a deployed monitoring stack.
## Reproduction
On a 0.14.0 monitoring VM (`mon-stardex-6120-...`, us-east-2) on 2026-04-29:
1. Edit volume file `/var/lib/docker/volumes/monitoring_postgres_ai_configs/_data/pgwatch-prometheus/metrics.yml` — e.g. apply the `calls >= 3 AND exec_time_total >= 1000` filter from https://gitlab.com/postgres-ai/postgresai/-/merge_requests/220.
2. `docker restart pgwatch-prometheus` — filter takes effect, dashboard works.
3. Later, run any `docker compose -f docker-compose.yml -p monitoring up -d sink-prometheus` (e.g. to bump VM memory cap).
4. `config-init` runs as a side effect, logs `Done. Copied: files: 31, directories: 12`.
5. **`metrics.yml` is now back to image defaults.** The filter is silently gone. pgwatch reverts to dumping all 5,023+ time series. Grafana panels start failing with VM 422 / 429 errors as cardinality balloons again.
The current script (`config/init-configs.sh`):
```sh
echo "Copying configs to $TARGET_DIR..."
# Copy all configs preserving structure
cp -r "$SOURCE_DIR"/* "$TARGET_DIR/"
echo "Done. Copied:"
```
No check for whether the target is already populated. No version marker. No backup of existing files.
## Impact
- Production-side metrics.yml tuning has zero persistence guarantee on 0.14.0.
- Anyone who follows the pattern documented in https://gitlab.com/postgres-ai/postgresai/-/merge_requests/220 ("Applied on production (34.203.111.251) on 2026-02-16: ... Patched both metrics.yml files ... Result: table repopulated with 851 rows") is at risk of losing the patch the next time they touch compose for any reason.
- Documentation does not mention the gotcha.
## Proposed fix
Make `init-configs.sh` idempotent. Two reasonable designs:
### Option A — version-marker, allow upgrades to flow through
```sh
TARGET_VERSION_FILE="${TARGET_DIR}/.pgai-configs-version"
SOURCE_VERSION="$(cat /VERSION)"
if [ -f "$TARGET_VERSION_FILE" ] && [ "$(cat "$TARGET_VERSION_FILE")" = "$SOURCE_VERSION" ]; then
echo "Configs already initialized at version $SOURCE_VERSION; skipping."
exit 0
fi
echo "Initializing configs (target version: $SOURCE_VERSION)..."
cp -r "$SOURCE_DIR"/* "$TARGET_DIR/"
echo "$SOURCE_VERSION" > "$TARGET_VERSION_FILE"
```
- **Pros:** preserves user edits during normal operation; bug fixes still flow through on intentional upgrades.
- **Cons:** if a user wants new image defaults *without* clearing the volume, they have to delete `.pgai-configs-version` manually. (Acceptable trade-off; document it.)
### Option B — only initialize when target is empty
```sh
if [ -n "$(ls -A "$TARGET_DIR" 2>/dev/null)" ]; then
echo "Target $TARGET_DIR already populated; skipping initialization."
exit 0
fi
cp -r "$SOURCE_DIR"/* "$TARGET_DIR/"
```
- **Pros:** simpler.
- **Cons:** never auto-updates after first install. User has to manually wipe the volume on upgrades.
**Recommended: Option A** — better UX on upgrades, while still safe by default. Document the manual-update escape hatch (`rm /target/.pgai-configs-version`) in CHANGELOG.
## Test plan
- [ ] First install on empty volume — copies all files, writes version marker.
- [ ] Re-run with same version — exits early, no overwrite. Hand-edit a file and confirm survives.
- [ ] Re-run after image version bump (different `/VERSION`) — copies all files, overwrites edits. Document this in release notes for the version that ships this fix.
- [ ] Add a unit/contract test in `tests/compliance_vectors/test_config_init.py` (or extend `tests/compliance_vectors/test_flask_resources.py` style) that sets up a fake `$SOURCE_DIR` and `$TARGET_DIR` and exercises both branches.
## Related
- https://gitlab.com/postgres-ai/postgresai/-/merge_requests/220 — the production-validated metrics filter that this bug actively undermines.
- https://gitlab.com/postgres-ai/infra/-/work_items/51 — ops write-up of the 2026-04-28 incident; this issue is a follow-up cause.
issue