Geo: fix container repository sync for OCI image indexes

What does this MR do?

Geo::ContainerRepositorySync silently failed to sync container repository tags whose manifest is an OCI image index. Tag counts looked fine but docker pull from the secondary returned manifest unknown. Confirmed in a customer environment where 52 of 62 tags in one repository were being silently skipped on every sync cycle.

Four coordinated fixes, each in its own commit:

  1. Widen ACCEPTED_TYPES — the default Faraday connection's Accept header now includes application/vnd.docker.distribution.manifest.list.v2+json and application/vnd.oci.image.index.v1+json, so HEAD requests for fat manifests return a real digest. Resolves the customer symptom.
  2. Submanifest mediaType fallback — when the OCI image manifest body omits mediaType (optional per the spec, often omitted by buildkit), fall back to the descriptor's mediaType from the parent index, then to OCI_MANIFEST_V1_TYPE. Also tightens the existing OCI test to assert the specific Content-Type that was previously matched with anything — the gap that hid this bug.
  3. Read secondary_tags from the Docker V2 client — mirror primary_tags's code path so both sides resolve digests through the same client. Eliminates residual digest-format asymmetry that could trigger false MATCH / MISMATCH on primary_tags - secondary_tags. Collapses the now-meaningless "GitLab API is supported / not supported" context splits in the spec.
  4. Clean up orphan tags with unresolvable digestsremove_tag falls back to deleting by tag name when the digest is absent, using the OCI tag-delete endpoint (DELETE /v2/<path>/manifests/<tag>, Container Registry 16.4+). Closes the orphan-tag cleanup gap from #465580 (closed) that the Accept-header fix on its own only prevents going forward.

How to test

Run the affected specs

bundle exec rspec \
  ee/spec/services/geo/container_repository_sync_spec.rb \
  spec/lib/container_registry/client_spec.rb \
  ee/spec/lib/container_registry/client_spec.rb

All 109 examples pass locally.

Reproduce the bug end-to-end on GDK with Geo

Requires a Geo primary + secondary running locally, container registry enabled on both, and docker buildx available.

  1. Build and push a multi-arch image to a project on the primary, so the resulting manifest is an OCI image index:

    docker buildx create --use --name local-builder
    docker buildx build \
      --platform linux/amd64,linux/arm64 \
      -t <primary-host>:5005/<group>/<project>:latest \
      --push \
      - <<EOF
    FROM alpine:latest
    EOF
  2. Confirm the primary stores an OCI image index:

    docker manifest inspect <primary-host>:5005/<group>/<project>:latest | jq -r .mediaType
    # application/vnd.oci.image.index.v1+json
  3. Trigger Geo container repository sync from the secondary's Rails console:

    cr = ContainerRepository.find_by(path: '<group>/<project>')
    Geo::ContainerRepositorySync.new(cr).execute
  4. Before this MR: docker pull from the secondary fails with manifest unknown:

    docker pull <secondary-host>:5005/<group>/<project>:latest
    # manifest unknown: manifest unknown
  5. Check out this MR's branch, restart the Rails console, repeat step 3 once more, then repeat step 4. The pull now succeeds.

Confirm the fix on a real Geo deployment

If you have access to an affected Geo secondary, run snippets A–F from the issue description's "Verification snippets" section before and after deploying this MR. Snippet B (narrow vs wide Accept HEAD) is the most direct signal: narrow_digest should go from nil to a real sha256:....

The workaround script in the issue description does the same reconcile work this MR fixes, and gives an additional sanity check that the fix produces the expected end state.

References

Edited by Douglas Barbosa Alexandre

Merge request reports

Loading