Geo: OCI image index tags silently skipped by container repository sync
## Summary `Geo::ContainerRepositorySync` silently fails to sync container repository tags whose manifest is an **OCI image index** (`application/vnd.oci.image.index.v1+json`). No error is raised. Tag counts look fine. But `docker pull` of the affected tag from the secondary returns `manifest unknown`. A customer reproduction confirmed the root cause and showed the impact is much wider than a single tag — **52 of 62 tags** in one affected repository were being silently skipped on every sync cycle. This is the missing-manifest mirror image of #465580 (orphan tags on the secondary that can't be deleted). Both have the same root cause. ## What the user sees ``` $ docker pull <secondary>/<group>/<project>:latest manifest unknown: manifest unknown ``` A naive tag-count check appears to match (e.g. `primary: 62 | secondary: 62`). The tag names themselves match too. The problem is that the **content** behind the tag on the secondary is stale and points at platform manifests that no longer exist on the secondary registry. ## What's broken There are two problems, but only the first is what's hitting customers now. ### 1. The `Accept` header is too narrow (primary cause, confirmed) `ContainerRegistry::Client#repository_tag_digest` sends a `HEAD` request with this `Accept` header: ``` application/vnd.docker.distribution.manifest.v2+json application/vnd.oci.image.manifest.v1+json ``` It does NOT include `application/vnd.oci.image.index.v1+json`. The newer GitLab container registry enforces `Accept` strictly for fat manifests and refuses to return a digest for an OCI index that isn't on the list. The call returns `nil`. Both primary and secondary HEADs return `nil` for OCI-index tags. So in the sync code: ```ruby # ee/app/services/geo/container_repository_sync.rb primary_tags # → [{name: "latest", digest: nil}, ...] secondary_tags # → [{name: "latest", digest: nil}, ...] tags_to_sync = primary_tags - secondary_tags # → [] tags_to_remove = secondary_tags - primary_tags # → [] ``` Identical `nil` digests cancel out by hash equality. **Nothing is scheduled. No error is logged. The tag is silently skipped.** The secondary's stored OCI index references platform manifest digests from a previous (now-stale) sync. Those platform manifests no longer exist on the secondary (garbage-collected, or never pushed). `docker pull` finds the index but fails on the inner platform-manifest lookup → `manifest unknown`. Docker manifest lists are not affected because the registry will content-negotiate to the inner Docker V2 manifest (which IS in `Accept`). OCI index has no such fallback. #### Downstream impact: Geo verification is also unreliable on legacy installs The same buggy `repository_tag_digest` is reached through `tag.digest` ([`lib/container_registry/tag.rb#L72-L76`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/container_registry/tag.rb#L72-L76)), which is called by `ContainerRepository#tag_list_digest` — the checksum Geo's verification framework uses to confirm replication integrity ([`ee/app/replicators/geo/container_repository_replicator.rb#L117`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/replicators/geo/container_repository_replicator.rb#L117)). - **With** the GitLab Container Registry API enabled, `tag.digest` returns `@manifest_digest` populated by `transform_tags_page` ([`app/models/container_repository.rb#L348`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/container_repository.rb#L348)) — real digests from the API. Verification works correctly and has been flagging affected registries as `verification_failed`. - **Without** the API (legacy installs), `tag.digest` falls back to the buggy `repository_tag_digest`. Both sides compute nil-filled checksums that happen to match. **Verification silently reports success on broken data.** Fix 1 cleans this up automatically — once `repository_tag_digest` returns real digests, `tag_list_digest` becomes meaningful again on every install. ### 2. Submanifest push uses an optional field with no fallback (latent) In [`ee/app/services/geo/container_repository_sync.rb#L53-L63`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/geo/container_repository_sync.rb#L53-L63): ```ruby container_repository.push_manifest( submanifest_ref['digest'], submanifest_raw, submanifest_parsed['mediaType'] # ← can be nil ) ``` Per the [OCI image manifest spec](https://github.com/opencontainers/image-spec/blob/main/manifest.md#image-manifest-property-descriptions), body `mediaType` is **optional**. `docker buildx` / buildkit frequently omit it. When nil, the push goes out with no `Content-Type`, the registry rejects it, the per-tag rescue swallows the error, and the parent OCI index push at line 74 never runs. The parent-manifest path at line 73 already has the correct fallback. The submanifest path was missed. This wasn't the cause for the customer we investigated (both platform manifests carried body `mediaType`), but it's a real latent bug that should be fixed in the same MR. ## Customer reproduction summary Full details in [note 3363582817](#note_3363582817). - Same `HEAD` request, narrow `Accept` → `nil`; wide `Accept` → `sha256:599be818...`. Reproduced live. - 52 of 62 primary tags and 51 of 62 secondary tags resolve to `digest: nil`. - `tags_to_sync: []`, `tags_to_remove: []`. - Tag names match. Symptom is **staleness**, not absence. - Both platform submanifests for `:latest` carry body `mediaType` — Fix 2 not firing for this case. ## Proposed fix Two coordinated changes in a single MR. ### Fix 1 — widen `ACCEPTED_TYPES` (primary fix) `lib/container_registry/base_client.rb`: ```ruby ACCEPTED_TYPES = [ DOCKER_DISTRIBUTION_MANIFEST_V2_TYPE, OCI_MANIFEST_V1_TYPE, DOCKER_DISTRIBUTION_MANIFEST_LIST_V2_TYPE, OCI_DISTRIBUTION_INDEX_TYPE ].freeze ``` Collapses `ACCEPTED_TYPES` and `ACCEPTED_TYPES_RAW` to a single list, matching the existing pattern in `ee/app/models/virtual_registries/container/upstream.rb`. ### Fix 2 — submanifest mediaType fallback (latent robustness) `ee/app/services/geo/container_repository_sync.rb`: ```ruby manifest_parsed['manifests'].each do |submanifest_ref| submanifest_raw = client.repository_raw_manifest(repository_path, submanifest_ref['digest']) submanifest_parsed = Gitlab::Json.safe_parse(submanifest_raw) sync_manifest_blobs(submanifest_parsed) submanifest_media_type = submanifest_parsed['mediaType'] || submanifest_ref['mediaType'] || ContainerRegistry::Client::OCI_MANIFEST_V1_TYPE container_repository.push_manifest( submanifest_ref['digest'], submanifest_raw, submanifest_media_type ) end ``` The descriptor's `mediaType` is REQUIRED by the OCI spec, so it's a reliable fallback. ### Fix 3 — read `secondary_tags` from the Docker V2 client `Geo::ContainerRepositorySync#primary_tags` reads digests from the Docker V2 client (`client.repository_tag_digest`). `#secondary_tags` went through `container_repository.tags`, which prefers the GitLab Container Registry API client (`gitlab_api_client`) when available. The two paths can return digests in different formats for the same manifest content, so even after Fix 1 closes the nil-digest gap, residual asymmetry could still cause false-MATCH / false-MISMATCH on `primary_tags - secondary_tags`. Mirror the primary client setup on the secondary so both sides resolve digests through the same Docker V2 code path: ```ruby def secondary_client strong_memoize_with_expiration(:secondary_client, ContainerRepository.registry_client_expiration_time) do ContainerRegistry::Client.new( Gitlab.config.registry.api_url, token: ::Auth::ContainerRegistryAuthenticationService.pull_access_token(repository_path) ) end end def secondary_tags strong_memoize(:secondary_tags) do manifest = secondary_client.repository_tags(repository_path) next [] unless manifest && manifest['tags'] manifest['tags'].map do |tag| { name: tag, digest: secondary_client.repository_tag_digest(repository_path, tag) } end end end ``` ### Test gaps to close - `ee/spec/services/geo/container_repository_sync_spec.rb`: extend the OCI manifest list context with a case where the inner manifest body omits `mediaType`. Assert `push_manifest` is called with a non-nil `Content-Type` — existing tests pass `anything` for that argument, which is why Fix 2 slipped through. - `ee/spec/lib/container_registry/client_spec.rb`: update `Accept`-header expectations to match the merged list. - Once Fix 3 lands, `container_repository.tags` is no longer called from sync, so the "when the GitLab API is / is not supported" context splits collapse into a single Docker V2 stub path. ## Workaround for affected installs (no upgrade required) For customers on a version that doesn't yet include Fix 1, the Rails console script below reconciles container repository tags by bypassing the buggy comparison path: 1. Lists tag names on both sides via the standard Docker V2 API. 2. For shared tags, compares digests using `HEAD` with the wide `Accept` header — the same workaround the fix applies. 3. Force-resyncs missing and stale tags by calling the private `sync_tag` method directly (which uses the wide `Accept` header internally). 4. Removes orphan tags (only on secondary) via `DELETE /v2/<path>/manifests/<tag>` with the wide `Accept` header — the workaround from [#465580 (note 3325972034)](https://gitlab.com/gitlab-org/gitlab/-/work_items/465580#note_3325972034). Requires GitLab Container Registry 16.4+. 5. Re-verifies post-reconcile using the same wide-`Accept` comparison. 6. Triggers Geo's verification framework so the registry's `verification_state` catches up. Safe to run in dry-run mode (the default). Pass `execute: true` to apply. For installs with many `Geo::ContainerRepositoryRegistry` rows, the script ships with a `reconcile_all` wrapper that supports tunable `batch_size:` and a `start:` ID for resuming an interrupted run. Ctrl+C prints a ready-to-paste resume command containing the last touched registry ID, and `reconcile_all` defaults to a quiet mode that keeps the output proportional to the number of repos that actually need work. ### Check Geo verification status before / after ```ruby Geo::ContainerRepositoryRegistry.find_each do |reg| puts "CR ##{reg.container_repository_id} state=#{reg.state} verification_state=#{reg.verification_state}" end ``` Registries in `verification_state=4` (failed) before the reconcile are the ones the script should target. After a successful reconcile and the `verify_async` it triggers, they should flip to `verification_state=2` (succeeded). <details> <summary>Reconcile script (Rails console, runs on the SECONDARY)</summary> ```ruby # Workaround for #600486 — reconcile container repository tags on a Geo SECONDARY # Run in the secondary's Rails console. # # Detects drift between primary and secondary using HEAD with the wide Accept # header (the gap that #600486 fixes), force-resyncs missing/stale tags by # calling the private `sync_tag` method directly, removes orphan tags via the # OCI tag-delete endpoint (the workaround from #465580 note 3325972034), # re-verifies with our own wide-Accept check, and then triggers Geo's # verification framework so the registry's verification_state catches up. # # Why our verification check works despite the bug: the buggy code path uses # the narrow Accept header and is only exercised during the normal sync flow. # This script uses the wide Accept header directly. # # Note on Geo verification (the framework one): # - On installs with the GitLab Container Registry API enabled, # `tag_list_digest` reads `@manifest_digest` populated by the API client — # real digests — so verification works correctly and our `verify_async` # trigger is meaningful. # - On installs WITHOUT the API, `tag.digest` falls back to the buggy # `client.repository_tag_digest`, so both sides compute nil-filled # checksums that match. Verification reports success on broken data. # Affected installs need the upstream fix (#600486) before Geo verification # becomes trustworthy. # # Caveats: # - The OCI tag-delete endpoint requires GitLab Container Registry 16.4+. # - If an OCI submanifest body omits `mediaType` (Fix 2 latent bug), the # per-tag push will fail. The script logs the failure and the # re-verification will surface the remaining drift. require 'faraday' def wide_head_digest(faraday_conn, path, tag_name) resp = faraday_conn.head("/v2/#{path}/manifests/#{tag_name}") do |req| req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ') end return nil unless resp.success? resp.headers[DependencyProxy::Manifest::DIGEST_HEADER] end def wide_delete_tag(faraday_conn, path, tag_name) faraday_conn.delete("/v2/#{path}/manifests/#{tag_name}") do |req| req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ') end end def build_conn(url, token) # Bump default Net::HTTP timeouts (60s) for slow registries under load. # Tune these higher if you still see Faraday::TimeoutError / Net::ReadTimeout. Faraday.new(url, request: { open_timeout: 60, read_timeout: 120, timeout: 120 }) do |f| f.request :authorization, :bearer, token f.adapter :net_http end end def check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn) primary_names = (primary_client.repository_tags(cr_path)&.dig('tags') || []).sort secondary_names = (secondary_client.repository_tags(cr_path)&.dig('tags') || []).sort only_on_primary = primary_names - secondary_names only_on_secondary = secondary_names - primary_names shared = primary_names & secondary_names stale = shared.select do |name| wide_head_digest(primary_conn, cr_path, name) != wide_head_digest(secondary_conn, cr_path, name) end { primary_count: primary_names.size, secondary_count: secondary_names.size, only_on_primary: only_on_primary, only_on_secondary: only_on_secondary, stale: stale } end def print_state(label, s) puts "#{label}:" puts " Primary tags: #{s[:primary_count]}" puts " Secondary tags: #{s[:secondary_count]}" puts " Only on primary: #{s[:only_on_primary].size}" puts " Only on secondary: #{s[:only_on_secondary].size}" puts " Stale on secondary: #{s[:stale].size}" end def reconcile_container_repository(container_repository, execute: false, quiet: false) cr_path = container_repository.path sync = Geo::ContainerRepositorySync.new(container_repository) primary_token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path) secondary_token = Auth::ContainerRegistryAuthenticationService.full_access_token(cr_path) primary_url = Gitlab.config.geo.registry_replication.primary_api_url secondary_url = Gitlab.config.registry.api_url primary_conn = build_conn(primary_url, primary_token) secondary_conn = build_conn(secondary_url, secondary_token) primary_client = ContainerRegistry::Client.new(primary_url, token: primary_token) secondary_client = ContainerRegistry::Client.new(secondary_url, token: secondary_token) before = check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn) tags_to_resync = before[:only_on_primary] + before[:stale] tags_to_remove = before[:only_on_secondary] in_sync = tags_to_resync.empty? && tags_to_remove.empty? # Quiet mode: skip output for already-in-sync repos. Drift / errors still print. return if in_sync && quiet puts "=" * 70 puts "CR ##{container_repository.id} #{cr_path} (execute: #{execute})" print_state("Before", before) if in_sync puts "Already in sync — nothing to do." return end unless execute puts "DRY RUN." puts " Resync candidates (first 10): #{tags_to_resync.first(10).inspect}" if tags_to_resync.any? puts " Remove candidates (first 10): #{tags_to_remove.first(10).inspect}" if tags_to_remove.any? return end resync_ok = resync_failed = 0 tags_to_resync.each do |name| sync.send(:sync_tag, { name: name }) resync_ok += 1 puts " resync ✓ #{name}" rescue StandardError => e resync_failed += 1 puts " resync ✗ #{name} #{e.class}: #{e.message}" end remove_ok = remove_failed = 0 tags_to_remove.each do |name| resp = wide_delete_tag(secondary_conn, cr_path, name) if resp.success? || resp.status == 404 remove_ok += 1 puts " remove ✓ #{name} (status #{resp.status})" else remove_failed += 1 puts " remove ✗ #{name} status=#{resp.status} body=#{resp.body.to_s[0..200]}" end rescue StandardError => e remove_failed += 1 puts " remove ✗ #{name} #{e.class}: #{e.message}" end puts "Done. resync ok=#{resync_ok} failed=#{resync_failed} remove ok=#{remove_ok} failed=#{remove_failed}" puts "" after = check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn) print_state("After", after) matched = after[:only_on_primary].empty? && after[:only_on_secondary].empty? && after[:stale].empty? if matched puts "✓ Verified: secondary fully matches primary." else remaining = after[:only_on_primary].size + after[:only_on_secondary].size + after[:stale].size puts "✗ #{remaining} tag(s) still mismatched after reconcile. Check the failure log above." end # Trigger Geo's own verification framework so the registry state catches up. if matched container_repository.replicator.verify_async puts "→ Geo verify_async queued. Check verification_state on the registry shortly." else puts "→ Skipping Geo verify_async because drift remains; address the failures first." end end # Batch wrapper over `Geo::ContainerRepositoryRegistry.find_each` with # resumable iteration. Prints a progress checkpoint every `progress_every` # registries and, on Ctrl+C, prints a ready-to-paste resume command using # the last touched registry ID. Defaults to `quiet: true` so already-in-sync # repositories do not flood the console — drift and errors still print. def reconcile_all(batch_size: 1000, start: nil, execute: false, progress_every: 100, quiet: true) last_id = nil total = 0 errors = 0 begin Geo::ContainerRepositoryRegistry.find_each(batch_size: batch_size, start: start) do |reg| last_id = reg.id total += 1 begin cr = reg.container_repository reconcile_container_repository(cr, execute: execute, quiet: quiet) if cr rescue StandardError => e errors += 1 puts "✗ Registry ##{reg.id} (CR ##{reg.container_repository_id}): #{e.class}: #{e.message}" end if (total % progress_every).zero? puts ">>> Progress: #{total} processed | last registry ID: #{last_id} | errors: #{errors}" end end rescue Interrupt puts "" puts ">>> Interrupted at registry ID #{last_id}." puts ">>> Resume with: reconcile_all(execute: #{execute}, batch_size: #{batch_size}, start: #{last_id})" raise end puts "" puts ">>> Done. Processed: #{total} | errors: #{errors} | last registry ID: #{last_id}" end # === Usage === # # Single repo, dry-run: # reconcile_container_repository(ContainerRepository.find(<id>)) # # Single repo, execute: # reconcile_container_repository(ContainerRepository.find(<id>), execute: true) # # All Geo-tracked repos, dry-run (quiet by default — only drift / errors print): # reconcile_all # # All Geo-tracked repos, execute mode with a smaller batch and more frequent progress lines: # reconcile_all(execute: true, batch_size: 500, progress_every: 50) # # Resume after Ctrl+C / dropped session — copy the registry ID from the printed message: # reconcile_all(execute: true, batch_size: 500, start: 14567) # # Verbose mode (prints header / Before / "Already in sync" for every CR): # reconcile_all(execute: true, quiet: false) ``` </details> ## Recovery Once Fix 1 lands: - `repository_tag_digest` returns real digests on both sides. - For currently-stale tags, primary and secondary digests differ → tag enters `tags_to_sync` → sync should heal it on the next pass. - For severely stale repos where the old platform manifest blobs have already been garbage-collected, a manual registry resync may still be needed. ## Verification snippets (read-only) <details> <summary>Rails console snippets to confirm the issue on any environment</summary> Replace `<container-repository-id>` and the path/tag placeholders with real values from an affected environment. These are read-only — no manifests are pushed, deleted, or modified. ### A. Confirm the count-MATCH is misleading ```ruby cr = ContainerRepository.find(<container-repository-id>) sync = Geo::ContainerRepositorySync.new(cr) primary_names = sync.send(:primary_tags).map { |t| t[:name] }.sort secondary_names = sync.send(:secondary_tags).map { |t| t[:name] }.sort puts "Counts: primary=#{primary_names.size} secondary=#{secondary_names.size}" puts "Only on primary: #{(primary_names - secondary_names).inspect}" puts "Only on secondary: #{(secondary_names - primary_names).inspect}" ``` ### B. Narrow vs wide Accept header (the core hypothesis test) ```ruby cr_path = '<group>/<project>' tag_name = 'latest' token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path) primary_url = Gitlab.config.geo.registry_replication.primary_api_url # Current code (narrow Accept — missing list/index types) client = ContainerRegistry::Client.new(primary_url, token: token) narrow_digest = client.repository_tag_digest(cr_path, tag_name) # Proposed fix (wide Accept — includes list/index types) require 'faraday' conn = Faraday.new(primary_url) do |f| f.request :authorization, :bearer, token f.adapter :net_http end resp = conn.head("/v2/#{cr_path}/manifests/#{tag_name}") do |req| req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ') end wide_digest = resp.headers[DependencyProxy::Manifest::DIGEST_HEADER] puts "Narrow Accept (current): #{narrow_digest.inspect}" puts "Wide Accept (fixed): #{wide_digest.inspect} status: #{resp.status}" ``` Expected: `narrow_digest` is `nil`; `wide_digest` is `sha256:...`. ### C. Show the silent-skip in `tags_to_sync` / `tags_to_remove` ```ruby cr = ContainerRepository.find(<container-repository-id>) sync = Geo::ContainerRepositorySync.new(cr) primary_tags = sync.send(:primary_tags) secondary_tags = sync.send(:secondary_tags) nil_primary = primary_tags.select { |t| t[:digest].nil? }.map { |t| t[:name] } nil_secondary = secondary_tags.select { |t| t[:digest].nil? }.map { |t| t[:name] } puts "Primary tags with nil digest: #{nil_primary.inspect}" puts "Secondary tags with nil digest: #{nil_secondary.inspect}" puts "tags_to_sync: #{sync.send(:tags_to_sync).map { |t| t[:name] }.inspect}" puts "tags_to_remove: #{sync.send(:tags_to_remove).map { |t| t[:name] }.inspect}" ``` ### D. Check if the submanifest body-mediaType bug is also active ```ruby cr_path = '<group>/<project>' tag_name = 'latest' token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path) primary_client = ContainerRegistry::Client.new( Gitlab.config.geo.registry_replication.primary_api_url, token: token ) index_raw = primary_client.repository_raw_manifest(cr_path, tag_name) index = Gitlab::Json.safe_parse(index_raw) puts "Index body mediaType: #{index['mediaType'].inspect}" index['manifests'].each_with_index do |sub_ref, i| sub_raw = primary_client.repository_raw_manifest(cr_path, sub_ref['digest']) sub_parsed = Gitlab::Json.safe_parse(sub_raw) puts " [#{i}] #{sub_ref['digest']}" puts " descriptor.mediaType: #{sub_ref['mediaType'].inspect}" puts " body.mediaType: #{sub_parsed['mediaType'].inspect}" end ``` ### E. Inspect the secondary's view of the failing tag ```ruby cr = ContainerRepository.find(<container-repository-id>) tag_name = 'latest' secondary_tag_names = cr.tags.map(&:name) puts "Tag '#{tag_name}' present on secondary? #{secondary_tag_names.include?(tag_name)}" if secondary_tag_names.include?(tag_name) token = Auth::ContainerRegistryAuthenticationService.full_access_token(cr.path) secondary_client = ContainerRegistry::Client.new(Gitlab.config.registry.api_url, token: token) begin raw = secondary_client.repository_raw_manifest(cr.path, tag_name) parsed = Gitlab::Json.safe_parse(raw) puts "Secondary manifest mediaType: #{parsed['mediaType'].inspect}" puts raw[0..500] rescue StandardError => e puts "Could not fetch manifest from secondary: #{e.class}: #{e.message}" end end ``` ### F. Geo log evidence (host shell) ``` grep -E "container_repository_sync|Error while syncing tag" \ /var/log/gitlab/geo-logcursor/*.log | grep -i '<group>/<project>' ``` Lines like `Error while syncing tag latest: Push manifest error: ...` indicate Fix 2 is firing in practice (not just latent). </details> ## Related - #465580 — inverse symptom: orphan tags on secondary with `digest: nil` that can't be deleted. Same Accept-header root cause; that issue also proposes a name-based `remove_tag` fallback that complements this fix. - #465580 (note 3325972034) — independent identification of the Accept-header gap. /cc @eakca1
issue