fix(retrieval): pull image when inspect returns wrapped not-found

Closes #700 (closed).

Summary

tools.PullImage used a direct Go type assertion (err.(errdefs.ErrNotFound)) to detect when an image was missing from the local cache. With the recent Docker SDK v23 → v28 bump (9603f326), the daemon's 404 from ImageInspect is wrapped before reaching the call site, so the assertion silently stopped matching. PullImage then misclassified "image not yet pulled" as a fatal inspect failure and never attempted the pull, surfacing as failed to refresh data: failed to scan pulling image response: failed to inspect image \"...\": Error response from daemon: No such image: ... on every retrieval refresh against a not-yet-cached image.

Changes

  • engine/internal/retrieval/engine/postgres/tools/tools.go

    • Switch the not-found detection from err.(errdefs.ErrNotFound) to cerrdefs.IsNotFound(err) (containerd/errdefs), which walks the error chain via errors.Is and handles wrapping correctly. docker/errdefs.IsNotFound is now a deprecated alias for the same function, so this also resolves the staticcheck SA1019 warning that the v28 bump introduced.
    • Propagate the JSON-stream error from jsonmessage.DisplayJSONMessagesToStream instead of logging-and-returning-nil. Without this, registry "manifest unknown", auth failures, and any other error that surfaces only in the pull stream body would be silently swallowed and the caller would proceed as if the pull succeeded.
    • Introduce a small unexported imagePuller interface (just ImageInspect + ImagePull) and change PullImage's parameter type from *client.Client to that interface, so the function is unit-testable without a real Docker daemon. *client.Client already satisfies the interface, so all 7 call sites compile unchanged.
  • engine/internal/retrieval/engine/postgres/logical/dump.go and engine/internal/runci/handlers.go

    • Replace the misleading errors.Wrap(err, \"failed to scan pulling image response\") wrapper text with fmt.Errorf(\"failed to prepare {dump,runner} image %q: %w\", ...). The 'scan pulling image response' phrase was leftover text from a former JSON-decode loop in PullImage that no longer exists.
  • engine/internal/retrieval/engine/postgres/tools/tools_test.go

    • Add TestPullImage, a table-driven test with 5 subtests covering every branch:
      • cache hit (inspect returns populated InspectResponse) → no pull
      • plain not-found → falls through to pull
      • wrapped not-found (regression guard for this bug — fmt.Errorf(\"...: %w\", fmt.Errorf(\"...: %w\", cerrdefs.ErrNotFound))) → falls through to pull
      • non-not-found inspect error → returns wrapped "failed to inspect image"
      • pull stream contains errorDetail → returns wrapped "failed to pull image"
    • Mocking uses a fakePuller struct with function-typed fields, matching the existing convention from internal/rdsrefresh/rds_test.go.

Verification

  • make fmt clean
  • go vet ./... clean on touched packages
  • make test — full unit suite passes, including all tools.PullImage consumers (tools, logical, physical, snapshot, provision/docker)
  • golangci-lint v2.11.4 run — 0 issues (the v2 lint config is what make run-lint uses)
  • End-to-end smoke test on a live DBLab instance running this branch's dblab_server image:
    • reproduced the original failure scenario via POST /admin/config with a not-yet-cached postgresai/extended-postgres:16-0.5.2
    • confirmed the FullRefresh goroutine now reaches tools.PullImage, the image is pulled successfully (visible as Pulling from postgresai/extended-postgres / Status: Downloaded newer image in dblab_server stdout), and the dump job moves on to the next stage
    • retrieval state moves from pendingrefreshing instead of refresh_failed

Merge request reports

Loading