fix(retrieval): pull image when inspect returns wrapped not-found
Closes #700 (closed).
Summary
tools.PullImage used a direct Go type assertion (err.(errdefs.ErrNotFound)) to detect when an image was missing from the local cache. With the recent Docker SDK v23 → v28 bump (9603f326), the daemon's 404 from ImageInspect is wrapped before reaching the call site, so the assertion silently stopped matching. PullImage then misclassified "image not yet pulled" as a fatal inspect failure and never attempted the pull, surfacing as failed to refresh data: failed to scan pulling image response: failed to inspect image \"...\": Error response from daemon: No such image: ... on every retrieval refresh against a not-yet-cached image.
Changes
-
engine/internal/retrieval/engine/postgres/tools/tools.go- Switch the not-found detection from
err.(errdefs.ErrNotFound)tocerrdefs.IsNotFound(err)(containerd/errdefs), which walks the error chain viaerrors.Isand handles wrapping correctly.docker/errdefs.IsNotFoundis now a deprecated alias for the same function, so this also resolves the staticcheck SA1019 warning that the v28 bump introduced. - Propagate the JSON-stream error from
jsonmessage.DisplayJSONMessagesToStreaminstead of logging-and-returning-nil. Without this, registry "manifest unknown", auth failures, and any other error that surfaces only in the pull stream body would be silently swallowed and the caller would proceed as if the pull succeeded. - Introduce a small unexported
imagePullerinterface (justImageInspect+ImagePull) and changePullImage's parameter type from*client.Clientto that interface, so the function is unit-testable without a real Docker daemon.*client.Clientalready satisfies the interface, so all 7 call sites compile unchanged.
- Switch the not-found detection from
-
engine/internal/retrieval/engine/postgres/logical/dump.goandengine/internal/runci/handlers.go- Replace the misleading
errors.Wrap(err, \"failed to scan pulling image response\")wrapper text withfmt.Errorf(\"failed to prepare {dump,runner} image %q: %w\", ...). The 'scan pulling image response' phrase was leftover text from a former JSON-decode loop inPullImagethat no longer exists.
- Replace the misleading
-
engine/internal/retrieval/engine/postgres/tools/tools_test.go- Add
TestPullImage, a table-driven test with 5 subtests covering every branch:- cache hit (inspect returns populated
InspectResponse) → no pull - plain not-found → falls through to pull
- wrapped not-found (regression guard for this bug —
fmt.Errorf(\"...: %w\", fmt.Errorf(\"...: %w\", cerrdefs.ErrNotFound))) → falls through to pull - non-not-found inspect error → returns wrapped "failed to inspect image"
- pull stream contains
errorDetail→ returns wrapped "failed to pull image"
- cache hit (inspect returns populated
- Mocking uses a
fakePullerstruct with function-typed fields, matching the existing convention frominternal/rdsrefresh/rds_test.go.
- Add
Verification
make fmtcleango vet ./...clean on touched packagesmake test— full unit suite passes, including alltools.PullImageconsumers (tools,logical,physical,snapshot,provision/docker)golangci-lint v2.11.4 run— 0 issues (the v2 lint config is whatmake run-lintuses)- End-to-end smoke test on a live DBLab instance running this branch's
dblab_serverimage:- reproduced the original failure scenario via
POST /admin/configwith a not-yet-cachedpostgresai/extended-postgres:16-0.5.2 - confirmed the FullRefresh goroutine now reaches
tools.PullImage, the image is pulled successfully (visible asPulling from postgresai/extended-postgres / Status: Downloaded newer imagein dblab_server stdout), and the dump job moves on to the next stage - retrieval state moves from
pending→refreshinginstead ofrefresh_failed
- reproduced the original failure scenario via