fix(cli): refresh stale docker-compose.yml on non-git npx upgrade (VM_AUTH wiring) — GA blocker #186
Summary
Fixes the GA-blocking npx-upgrade bug for 0.15.0: the documented 0.14 -> 0.15 in-place upgrade leaves the old docker-compose.yml in place on non-git installs, so VictoriaMetrics (sink-prometheus) crashes and all dashboards go dataless after upgrade.
Issue: #186
This blocks 0.15.0 GA and requires an rc.8. It hits every existing npx /
npm install -gself-hosted user on upgrade.
Problem / root cause (with file:line)
docker-compose.yml is a version-coupled, CLI-owned static asset - the npm package does not bundle it; it is fetched from GitLab raw at install time. The compose was only ever refreshed for git checkouts:
cli/bin/postgres-ai.ts-ensureDefaultMonitoringProject()fetches the compose onlyif (!fs.existsSync(composeFile)), so an existing install never re-fetches it.mon updatedoesgit pullonly if.gitexists; the non-gitelsebranch went "straight to image pull" against the stale compose.- npx / global-npm installs are non-git, so on upgrade only
PGAI_TAGadvances (to 0.15) while the 0.14docker-compose.ymlis retained.
0.15 introduced VictoriaMetrics basic auth: the 0.15 configs image ships a prometheus.yml templating %{VM_AUTH_USERNAME}, and the 0.15 compose wires VM_AUTH_* into sink-prometheus. The retained 0.14 compose has no VM_AUTH_* wiring, so on boot:
sink-prometheus Exited (255)
missing "VM_AUTH_USERNAME" env var-> no metrics ingested -> every dashboard is empty after the upgrade.
(Note: a prior fix already migrates .env to add the VM_AUTH_* keys - but that is necessary, not sufficient: the compose also has to be refreshed to actually consume them on the sink-prometheus service.)
Fix (CLI-only, additive)
Adds one helper refreshBundledComposeIfStale(projectDir, oldTag?) and wires it into the three in-place-upgrade entrypoints (local-install after the .env write; mon update non-git else branch; mon update-config after the env migration). Contract:
- no-op for git checkouts (
.gitpresent -> already refreshed viagit pull); - no-op when there is no deployed compose yet (the green-field bootstrap path handles it);
- no-op when the deployed compose content already matches the target (content-compared, whitespace-tolerant — not a tag heuristic; correct even though
local-installrewritesPGAI_TAGjust before); - validates the fetched payload before it can replace anything (see hardening below);
- backs up the prior compose before overwriting, never overwriting an existing backup (
bak-<oldtag>-<hash8>), so the pristine original survives repeated runs; - touches only
docker-compose.yml- never.env/instances.yml/.pgwatch-config; - best-effort: a fetch/validation failure warns and keeps the existing compose (the upgrade still proceeds) rather than turning a metrics-only outage into a hard CLI failure.
No image change required - the compose is fetched from GitLab raw at the target ref, exactly as the green-field bootstrap already does. A PGAI_COMPOSE_SOURCE test-only seam (honored only when NODE_ENV === "test") lets the refresh read a local file for hermetic, offline tests.
README upgrade section updated to state that mon update / mon update-config / local-install now refresh the compose on non-git installs (validated, with a preserved backup), instead of implying the .env migration alone is sufficient.
Hardening from adversarial review (commits 31c582f, b9d8ef3)
The first revision content-compared the fetched body against the deployed one but did not content-validate it — downloadText only checks response.ok. Adversarial review caught that a non-compose 200 body (HTML login/captcha/proxy/maintenance page) would have silently clobbered a working compose with junk — strictly worse than the original bug. That is now fixed (it was the merge BLOCKER), along with two should-fix issues and three nits:
- [BLOCKER — fixed] Fetched-compose validation. Before overwriting, the payload is validated: reject obvious HTML (
<!DOCTYPE/<html/<?xml), parse withjs-yaml, and require aservicesmap containing the keystonesink-prometheusservice. On failure the refresh behaves exactly like a fetch failure: keep the existing compose, write no backup, warn, no-op. - [fixed] Backup collision. Was
bak-<tag>written with an overwritingcopyFileSync, so repeatedupdate-configruns (tag never advances) destroyed the pristine original on the 2nd run. Now uniquified by an 8-char hash of the OLD content and written withflag: "wx"(never overwrite), so the first/pristine backup is always preserved. - [fixed] local-install mislabeled backup.
local-installrewrites.envPGAI_TAGto the new version before the refresh, so the OLD compose's backup was named with the new tag. It now captures the OLD tag before the rewrite and passes it in -> backup isbak-<oldtag>-<hash>. - [fixed] Seam hardening.
PGAI_COMPOSE_SOURCEis honored only whenNODE_ENV === "test"(not reachable in a normal user environment). - [fixed] Cosmetic log.
mon updateno longer pre-announces a refresh on no-op/failure; the helper logs only when it actually refreshes or warns. - [fixed] Whitespace-only staleness false-positive. Staleness compare ignores trailing-whitespace-only diffs.
RED -> GREEN test evidence
Hermetic regression tests in cli/test/upgrade.test.ts (offline; NODE_ENV=test + PGAI_COMPOSE_SOURCE -> local fixture).
Original bug — RED (implementation reverted, compose never refreshed):
(fail) non-git upgrade refreshes a stale docker-compose.yml to the target version ...
Expected substring or pattern: /VM_AUTH_USERNAME/
Received: "version: '3.8'\nservices:\n sink-prometheus: ... v1.115.0 ..."BLOCKER (garbage body) — RED (validation disabled, HTML clobbers the working compose):
(fail) ... an HTML (non-compose) 200 body leaves the deployed compose UNCHANGED ...
expect(received).toBe(expected)
- version: '3.8' (expected: the pristine 0.14 compose)
- sink-prometheus:
+ <!DOCTYPE html> (received: the compose was overwritten with HTML)
+ <html><head><title>Sign in</title></head>SHOULD-FIX (local-install old-tag) — RED (backup mislabeled with NEW tag):
(fail) ... local-install labels the backup with the OLD tag ...
Expected pattern: /^docker-compose\.yml\.bak-0\.14\.0-[0-9a-f]{8}$/
Received: "docker-compose.yml.bak-0.0.0-dev.0"GREEN (with the fix): 29 pass / 0 fail for cli/test/upgrade.test.ts (the 23 prior + 6 new: HTML-body, missing-keystone, empty-body, backup-collision, trailing-newline no-op, local-install-old-tag).
Test / build / typecheck
cli/test/upgrade.test.ts: 29/29 pass.bun run build: succeeds, built binary runs.bun run typecheck: my changed files have 0 errors. The 20tscerrors present are all pre-existing in untouched test files (monitoring.test.ts,permission-check-sql.test.ts,reports.test.ts) - unchanged fromorigin/main.- Full
bun test: 870 pass / 32 skip / 8 fail - the 8 failures are pre-existing environment artifacts (local-install tests short-circuit because real monitoring containers are running on the dev machine; one checkup-api HTTP-guard timing test). They fail identically againstorigin/main, i.e. not introduced by this MR.
Risk
Low / CLI-only. Additive helper, gated on non-git + content mismatch, validates the fetched payload before any write, never overwrites a prior backup, never touches user-owned files, best-effort on fetch/validation failure. No image or schema changes.
Recommended follow-up (deferred)
A release-readiness guard that diffs deployed-compose vs target-compose would prevent regression at the gate level. Deferred — release-readiness.sh currently has unrelated pre-existing blockers and a bash-level gate would need network/fixture plumbing; adding it here risks destabilizing this GA-blocker fix. The regression invariant is already locked in by the hermetic unit tests above.
Generated with Claude Code
Closes #186