fix(cli): refresh stale docker-compose.yml on non-git npx upgrade (VM_AUTH wiring) — GA blocker #186

Summary

Fixes the GA-blocking npx-upgrade bug for 0.15.0: the documented 0.14 -> 0.15 in-place upgrade leaves the old docker-compose.yml in place on non-git installs, so VictoriaMetrics (sink-prometheus) crashes and all dashboards go dataless after upgrade.

Issue: #186

This blocks 0.15.0 GA and requires an rc.8. It hits every existing npx / npm install -g self-hosted user on upgrade.

Problem / root cause (with file:line)

docker-compose.yml is a version-coupled, CLI-owned static asset - the npm package does not bundle it; it is fetched from GitLab raw at install time. The compose was only ever refreshed for git checkouts:

  • cli/bin/postgres-ai.ts - ensureDefaultMonitoringProject() fetches the compose only if (!fs.existsSync(composeFile)), so an existing install never re-fetches it.
  • mon update does git pull only if .git exists; the non-git else branch went "straight to image pull" against the stale compose.
  • npx / global-npm installs are non-git, so on upgrade only PGAI_TAG advances (to 0.15) while the 0.14 docker-compose.yml is retained.

0.15 introduced VictoriaMetrics basic auth: the 0.15 configs image ships a prometheus.yml templating %{VM_AUTH_USERNAME}, and the 0.15 compose wires VM_AUTH_* into sink-prometheus. The retained 0.14 compose has no VM_AUTH_* wiring, so on boot:

sink-prometheus  Exited (255)
  missing "VM_AUTH_USERNAME" env var

-> no metrics ingested -> every dashboard is empty after the upgrade.

(Note: a prior fix already migrates .env to add the VM_AUTH_* keys - but that is necessary, not sufficient: the compose also has to be refreshed to actually consume them on the sink-prometheus service.)

Fix (CLI-only, additive)

Adds one helper refreshBundledComposeIfStale(projectDir, oldTag?) and wires it into the three in-place-upgrade entrypoints (local-install after the .env write; mon update non-git else branch; mon update-config after the env migration). Contract:

  • no-op for git checkouts (.git present -> already refreshed via git pull);
  • no-op when there is no deployed compose yet (the green-field bootstrap path handles it);
  • no-op when the deployed compose content already matches the target (content-compared, whitespace-tolerant — not a tag heuristic; correct even though local-install rewrites PGAI_TAG just before);
  • validates the fetched payload before it can replace anything (see hardening below);
  • backs up the prior compose before overwriting, never overwriting an existing backup (bak-<oldtag>-<hash8>), so the pristine original survives repeated runs;
  • touches only docker-compose.yml - never .env / instances.yml / .pgwatch-config;
  • best-effort: a fetch/validation failure warns and keeps the existing compose (the upgrade still proceeds) rather than turning a metrics-only outage into a hard CLI failure.

No image change required - the compose is fetched from GitLab raw at the target ref, exactly as the green-field bootstrap already does. A PGAI_COMPOSE_SOURCE test-only seam (honored only when NODE_ENV === "test") lets the refresh read a local file for hermetic, offline tests.

README upgrade section updated to state that mon update / mon update-config / local-install now refresh the compose on non-git installs (validated, with a preserved backup), instead of implying the .env migration alone is sufficient.

Hardening from adversarial review (commits 31c582f, b9d8ef3)

The first revision content-compared the fetched body against the deployed one but did not content-validate it — downloadText only checks response.ok. Adversarial review caught that a non-compose 200 body (HTML login/captcha/proxy/maintenance page) would have silently clobbered a working compose with junk — strictly worse than the original bug. That is now fixed (it was the merge BLOCKER), along with two should-fix issues and three nits:

  1. [BLOCKER — fixed] Fetched-compose validation. Before overwriting, the payload is validated: reject obvious HTML (<!DOCTYPE/<html/<?xml), parse with js-yaml, and require a services map containing the keystone sink-prometheus service. On failure the refresh behaves exactly like a fetch failure: keep the existing compose, write no backup, warn, no-op.
  2. [fixed] Backup collision. Was bak-<tag> written with an overwriting copyFileSync, so repeated update-config runs (tag never advances) destroyed the pristine original on the 2nd run. Now uniquified by an 8-char hash of the OLD content and written with flag: "wx" (never overwrite), so the first/pristine backup is always preserved.
  3. [fixed] local-install mislabeled backup. local-install rewrites .env PGAI_TAG to the new version before the refresh, so the OLD compose's backup was named with the new tag. It now captures the OLD tag before the rewrite and passes it in -> backup is bak-<oldtag>-<hash>.
  4. [fixed] Seam hardening. PGAI_COMPOSE_SOURCE is honored only when NODE_ENV === "test" (not reachable in a normal user environment).
  5. [fixed] Cosmetic log. mon update no longer pre-announces a refresh on no-op/failure; the helper logs only when it actually refreshes or warns.
  6. [fixed] Whitespace-only staleness false-positive. Staleness compare ignores trailing-whitespace-only diffs.

RED -> GREEN test evidence

Hermetic regression tests in cli/test/upgrade.test.ts (offline; NODE_ENV=test + PGAI_COMPOSE_SOURCE -> local fixture).

Original bug — RED (implementation reverted, compose never refreshed):

(fail) non-git upgrade refreshes a stale docker-compose.yml to the target version ...
  Expected substring or pattern: /VM_AUTH_USERNAME/
  Received: "version: '3.8'\nservices:\n  sink-prometheus: ... v1.115.0 ..."

BLOCKER (garbage body) — RED (validation disabled, HTML clobbers the working compose):

(fail) ... an HTML (non-compose) 200 body leaves the deployed compose UNCHANGED ...
  expect(received).toBe(expected)
  - version: '3.8'                                  (expected: the pristine 0.14 compose)
  -   sink-prometheus:
  + <!DOCTYPE html>                                 (received: the compose was overwritten with HTML)
  + <html><head><title>Sign in</title></head>

SHOULD-FIX (local-install old-tag) — RED (backup mislabeled with NEW tag):

(fail) ... local-install labels the backup with the OLD tag ...
  Expected pattern: /^docker-compose\.yml\.bak-0\.14\.0-[0-9a-f]{8}$/
  Received:         "docker-compose.yml.bak-0.0.0-dev.0"

GREEN (with the fix): 29 pass / 0 fail for cli/test/upgrade.test.ts (the 23 prior + 6 new: HTML-body, missing-keystone, empty-body, backup-collision, trailing-newline no-op, local-install-old-tag).

Test / build / typecheck

  • cli/test/upgrade.test.ts: 29/29 pass.
  • bun run build: succeeds, built binary runs.
  • bun run typecheck: my changed files have 0 errors. The 20 tsc errors present are all pre-existing in untouched test files (monitoring.test.ts, permission-check-sql.test.ts, reports.test.ts) - unchanged from origin/main.
  • Full bun test: 870 pass / 32 skip / 8 fail - the 8 failures are pre-existing environment artifacts (local-install tests short-circuit because real monitoring containers are running on the dev machine; one checkup-api HTTP-guard timing test). They fail identically against origin/main, i.e. not introduced by this MR.

Risk

Low / CLI-only. Additive helper, gated on non-git + content mismatch, validates the fetched payload before any write, never overwrites a prior backup, never touches user-owned files, best-effort on fetch/validation failure. No image or schema changes.

A release-readiness guard that diffs deployed-compose vs target-compose would prevent regression at the gate level. Deferredrelease-readiness.sh currently has unrelated pre-existing blockers and a bash-level gate would need network/fixture plumbing; adding it here risks destabilizing this GA-blocker fix. The regression invariant is already locked in by the hermetic unit tests above.


Generated with Claude Code

Closes #186

Edited by Maya P

Merge request reports

Loading