Geo: OCI image index tags silently skipped by container repository sync

Summary

Geo::ContainerRepositorySync silently fails to sync container repository tags whose manifest is an OCI image index (application/vnd.oci.image.index.v1+json). No error is raised. Tag counts look fine. But docker pull of the affected tag from the secondary returns manifest unknown.

A customer reproduction confirmed the root cause and showed the impact is much wider than a single tag — 52 of 62 tags in one affected repository were being silently skipped on every sync cycle.

This is the missing-manifest mirror image of #465580 (closed) (orphan tags on the secondary that can't be deleted). Both have the same root cause.

What the user sees

$ docker pull <secondary>/<group>/<project>:latest
manifest unknown: manifest unknown

A naive tag-count check appears to match (e.g. primary: 62 | secondary: 62). The tag names themselves match too. The problem is that the content behind the tag on the secondary is stale and points at platform manifests that no longer exist on the secondary registry.

What's broken

There are two problems, but only the first is what's hitting customers now.

1. The Accept header is too narrow (primary cause, confirmed)

ContainerRegistry::Client#repository_tag_digest sends a HEAD request with this Accept header:

application/vnd.docker.distribution.manifest.v2+json
application/vnd.oci.image.manifest.v1+json

It does NOT include application/vnd.oci.image.index.v1+json. The newer GitLab container registry enforces Accept strictly for fat manifests and refuses to return a digest for an OCI index that isn't on the list. The call returns nil.

Both primary and secondary HEADs return nil for OCI-index tags. So in the sync code:

# ee/app/services/geo/container_repository_sync.rb
primary_tags    # → [{name: "latest", digest: nil}, ...]
secondary_tags  # → [{name: "latest", digest: nil}, ...]

tags_to_sync   = primary_tags - secondary_tags   # → []
tags_to_remove = secondary_tags - primary_tags   # → []

Identical nil digests cancel out by hash equality. Nothing is scheduled. No error is logged. The tag is silently skipped.

The secondary's stored OCI index references platform manifest digests from a previous (now-stale) sync. Those platform manifests no longer exist on the secondary (garbage-collected, or never pushed). docker pull finds the index but fails on the inner platform-manifest lookup → manifest unknown.

Docker manifest lists are not affected because the registry will content-negotiate to the inner Docker V2 manifest (which IS in Accept). OCI index has no such fallback.

Downstream impact: Geo verification is also unreliable on legacy installs

The same buggy repository_tag_digest is reached through tag.digest (lib/container_registry/tag.rb#L72-L76), which is called by ContainerRepository#tag_list_digest — the checksum Geo's verification framework uses to confirm replication integrity (ee/app/replicators/geo/container_repository_replicator.rb#L117).

  • With the GitLab Container Registry API enabled, tag.digest returns @manifest_digest populated by transform_tags_page (app/models/container_repository.rb#L348) — real digests from the API. Verification works correctly and has been flagging affected registries as verification_failed.
  • Without the API (legacy installs), tag.digest falls back to the buggy repository_tag_digest. Both sides compute nil-filled checksums that happen to match. Verification silently reports success on broken data.

Fix 1 cleans this up automatically — once repository_tag_digest returns real digests, tag_list_digest becomes meaningful again on every install.

2. Submanifest push uses an optional field with no fallback (latent)

In ee/app/services/geo/container_repository_sync.rb#L53-L63:

container_repository.push_manifest(
  submanifest_ref['digest'],
  submanifest_raw,
  submanifest_parsed['mediaType']   # ← can be nil
)

Per the OCI image manifest spec, body mediaType is optional. docker buildx / buildkit frequently omit it. When nil, the push goes out with no Content-Type, the registry rejects it, the per-tag rescue swallows the error, and the parent OCI index push at line 74 never runs.

The parent-manifest path at line 73 already has the correct fallback. The submanifest path was missed.

This wasn't the cause for the customer we investigated (both platform manifests carried body mediaType), but it's a real latent bug that should be fixed in the same MR.

Customer reproduction summary

Full details in note 3363582817.

  • Same HEAD request, narrow Acceptnil; wide Acceptsha256:599be818.... Reproduced live.
  • 52 of 62 primary tags and 51 of 62 secondary tags resolve to digest: nil.
  • tags_to_sync: [], tags_to_remove: [].
  • Tag names match. Symptom is staleness, not absence.
  • Both platform submanifests for :latest carry body mediaType — Fix 2 not firing for this case.

Proposed fix

Two coordinated changes in a single MR.

Fix 1 — widen ACCEPTED_TYPES (primary fix)

lib/container_registry/base_client.rb:

ACCEPTED_TYPES = [
  DOCKER_DISTRIBUTION_MANIFEST_V2_TYPE,
  OCI_MANIFEST_V1_TYPE,
  DOCKER_DISTRIBUTION_MANIFEST_LIST_V2_TYPE,
  OCI_DISTRIBUTION_INDEX_TYPE
].freeze

Collapses ACCEPTED_TYPES and ACCEPTED_TYPES_RAW to a single list, matching the existing pattern in ee/app/models/virtual_registries/container/upstream.rb.

Fix 2 — submanifest mediaType fallback (latent robustness)

ee/app/services/geo/container_repository_sync.rb:

manifest_parsed['manifests'].each do |submanifest_ref|
  submanifest_raw = client.repository_raw_manifest(repository_path, submanifest_ref['digest'])
  submanifest_parsed = Gitlab::Json.safe_parse(submanifest_raw)
  sync_manifest_blobs(submanifest_parsed)

  submanifest_media_type =
    submanifest_parsed['mediaType'] ||
    submanifest_ref['mediaType'] ||
    ContainerRegistry::Client::OCI_MANIFEST_V1_TYPE

  container_repository.push_manifest(
    submanifest_ref['digest'],
    submanifest_raw,
    submanifest_media_type
  )
end

The descriptor's mediaType is REQUIRED by the OCI spec, so it's a reliable fallback.

Fix 3 — read secondary_tags from the Docker V2 client

Geo::ContainerRepositorySync#primary_tags reads digests from the Docker V2 client (client.repository_tag_digest). #secondary_tags went through container_repository.tags, which prefers the GitLab Container Registry API client (gitlab_api_client) when available. The two paths can return digests in different formats for the same manifest content, so even after Fix 1 closes the nil-digest gap, residual asymmetry could still cause false-MATCH / false-MISMATCH on primary_tags - secondary_tags.

Mirror the primary client setup on the secondary so both sides resolve digests through the same Docker V2 code path:

def secondary_client
  strong_memoize_with_expiration(:secondary_client, ContainerRepository.registry_client_expiration_time) do
    ContainerRegistry::Client.new(
      Gitlab.config.registry.api_url,
      token: ::Auth::ContainerRegistryAuthenticationService.pull_access_token(repository_path)
    )
  end
end

def secondary_tags
  strong_memoize(:secondary_tags) do
    manifest = secondary_client.repository_tags(repository_path)
    next [] unless manifest && manifest['tags']

    manifest['tags'].map do |tag|
      { name: tag, digest: secondary_client.repository_tag_digest(repository_path, tag) }
    end
  end
end

Test gaps to close

  • ee/spec/services/geo/container_repository_sync_spec.rb: extend the OCI manifest list context with a case where the inner manifest body omits mediaType. Assert push_manifest is called with a non-nil Content-Type — existing tests pass anything for that argument, which is why Fix 2 slipped through.
  • ee/spec/lib/container_registry/client_spec.rb: update Accept-header expectations to match the merged list.
  • Once Fix 3 lands, container_repository.tags is no longer called from sync, so the "when the GitLab API is / is not supported" context splits collapse into a single Docker V2 stub path.

Workaround for affected installs (no upgrade required)

For customers on a version that doesn't yet include Fix 1, the Rails console script below reconciles container repository tags by bypassing the buggy comparison path:

  1. Lists tag names on both sides via the standard Docker V2 API.
  2. For shared tags, compares digests using HEAD with the wide Accept header — the same workaround the fix applies.
  3. Force-resyncs missing and stale tags by calling the private sync_tag method directly (which uses the wide Accept header internally).
  4. Removes orphan tags (only on secondary) via DELETE /v2/<path>/manifests/<tag> with the wide Accept header — the workaround from #465580 (note 3325972034). Requires GitLab Container Registry 16.4+.
  5. Re-verifies post-reconcile using the same wide-Accept comparison.
  6. Triggers Geo's verification framework so the registry's verification_state catches up.

Safe to run in dry-run mode (the default). Pass execute: true to apply.

For installs with many Geo::ContainerRepositoryRegistry rows, the script ships with a reconcile_all wrapper that supports tunable batch_size: and a start: ID for resuming an interrupted run. Ctrl+C prints a ready-to-paste resume command containing the last touched registry ID, and reconcile_all defaults to a quiet mode that keeps the output proportional to the number of repos that actually need work.

Check Geo verification status before / after

Geo::ContainerRepositoryRegistry.find_each do |reg|
  puts "CR ##{reg.container_repository_id}  state=#{reg.state}  verification_state=#{reg.verification_state}"
end

Registries in verification_state=4 (failed) before the reconcile are the ones the script should target. After a successful reconcile and the verify_async it triggers, they should flip to verification_state=2 (succeeded).

Reconcile script (Rails console, runs on the SECONDARY)
# Workaround for #600486 — reconcile container repository tags on a Geo SECONDARY
# Run in the secondary's Rails console.
#
# Detects drift between primary and secondary using HEAD with the wide Accept
# header (the gap that #600486 fixes), force-resyncs missing/stale tags by
# calling the private `sync_tag` method directly, removes orphan tags via the
# OCI tag-delete endpoint (the workaround from #465580 note 3325972034),
# re-verifies with our own wide-Accept check, and then triggers Geo's
# verification framework so the registry's verification_state catches up.
#
# Why our verification check works despite the bug: the buggy code path uses
# the narrow Accept header and is only exercised during the normal sync flow.
# This script uses the wide Accept header directly.
#
# Note on Geo verification (the framework one):
# - On installs with the GitLab Container Registry API enabled,
#   `tag_list_digest` reads `@manifest_digest` populated by the API client —
#   real digests — so verification works correctly and our `verify_async`
#   trigger is meaningful.
# - On installs WITHOUT the API, `tag.digest` falls back to the buggy
#   `client.repository_tag_digest`, so both sides compute nil-filled
#   checksums that match. Verification reports success on broken data.
#   Affected installs need the upstream fix (#600486) before Geo verification
#   becomes trustworthy.
#
# Caveats:
# - The OCI tag-delete endpoint requires GitLab Container Registry 16.4+.
# - If an OCI submanifest body omits `mediaType` (Fix 2 latent bug), the
#   per-tag push will fail. The script logs the failure and the
#   re-verification will surface the remaining drift.

require 'faraday'

def wide_head_digest(faraday_conn, path, tag_name)
  resp = faraday_conn.head("/v2/#{path}/manifests/#{tag_name}") do |req|
    req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ')
  end
  return nil unless resp.success?
  resp.headers[DependencyProxy::Manifest::DIGEST_HEADER]
end

def wide_delete_tag(faraday_conn, path, tag_name)
  faraday_conn.delete("/v2/#{path}/manifests/#{tag_name}") do |req|
    req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ')
  end
end

def build_conn(url, token)
  # Bump default Net::HTTP timeouts (60s) for slow registries under load.
  # Tune these higher if you still see Faraday::TimeoutError / Net::ReadTimeout.
  Faraday.new(url, request: { open_timeout: 60, read_timeout: 120, timeout: 120 }) do |f|
    f.request :authorization, :bearer, token
    f.adapter :net_http
  end
end

def check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn)
  primary_names   = (primary_client.repository_tags(cr_path)&.dig('tags')   || []).sort
  secondary_names = (secondary_client.repository_tags(cr_path)&.dig('tags') || []).sort

  only_on_primary   = primary_names - secondary_names
  only_on_secondary = secondary_names - primary_names
  shared            = primary_names & secondary_names

  stale = shared.select do |name|
    wide_head_digest(primary_conn, cr_path, name) != wide_head_digest(secondary_conn, cr_path, name)
  end

  {
    primary_count:     primary_names.size,
    secondary_count:   secondary_names.size,
    only_on_primary:   only_on_primary,
    only_on_secondary: only_on_secondary,
    stale:             stale
  }
end

def print_state(label, s)
  puts "#{label}:"
  puts "  Primary tags:       #{s[:primary_count]}"
  puts "  Secondary tags:     #{s[:secondary_count]}"
  puts "  Only on primary:    #{s[:only_on_primary].size}"
  puts "  Only on secondary:  #{s[:only_on_secondary].size}"
  puts "  Stale on secondary: #{s[:stale].size}"
end

def reconcile_container_repository(container_repository, execute: false, quiet: false)
  cr_path = container_repository.path

  sync = Geo::ContainerRepositorySync.new(container_repository)

  primary_token   = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path)
  secondary_token = Auth::ContainerRegistryAuthenticationService.full_access_token(cr_path)
  primary_url     = Gitlab.config.geo.registry_replication.primary_api_url
  secondary_url   = Gitlab.config.registry.api_url
  primary_conn    = build_conn(primary_url,   primary_token)
  secondary_conn  = build_conn(secondary_url, secondary_token)
  primary_client   = ContainerRegistry::Client.new(primary_url,   token: primary_token)
  secondary_client = ContainerRegistry::Client.new(secondary_url, token: secondary_token)

  before = check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn)

  tags_to_resync = before[:only_on_primary] + before[:stale]
  tags_to_remove = before[:only_on_secondary]
  in_sync = tags_to_resync.empty? && tags_to_remove.empty?

  # Quiet mode: skip output for already-in-sync repos. Drift / errors still print.
  return if in_sync && quiet

  puts "=" * 70
  puts "CR ##{container_repository.id} #{cr_path}  (execute: #{execute})"
  print_state("Before", before)

  if in_sync
    puts "Already in sync — nothing to do."
    return
  end

  unless execute
    puts "DRY RUN."
    puts "  Resync candidates (first 10): #{tags_to_resync.first(10).inspect}" if tags_to_resync.any?
    puts "  Remove candidates (first 10): #{tags_to_remove.first(10).inspect}" if tags_to_remove.any?
    return
  end

  resync_ok = resync_failed = 0
  tags_to_resync.each do |name|
    sync.send(:sync_tag, { name: name })
    resync_ok += 1
    puts "  resync ✓ #{name}"
  rescue StandardError => e
    resync_failed += 1
    puts "  resync ✗ #{name}  #{e.class}: #{e.message}"
  end

  remove_ok = remove_failed = 0
  tags_to_remove.each do |name|
    resp = wide_delete_tag(secondary_conn, cr_path, name)
    if resp.success? || resp.status == 404
      remove_ok += 1
      puts "  remove ✓ #{name}  (status #{resp.status})"
    else
      remove_failed += 1
      puts "  remove ✗ #{name}  status=#{resp.status} body=#{resp.body.to_s[0..200]}"
    end
  rescue StandardError => e
    remove_failed += 1
    puts "  remove ✗ #{name}  #{e.class}: #{e.message}"
  end

  puts "Done. resync ok=#{resync_ok} failed=#{resync_failed}  remove ok=#{remove_ok} failed=#{remove_failed}"

  puts ""
  after = check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn)
  print_state("After", after)

  matched = after[:only_on_primary].empty? && after[:only_on_secondary].empty? && after[:stale].empty?

  if matched
    puts "✓ Verified: secondary fully matches primary."
  else
    remaining = after[:only_on_primary].size + after[:only_on_secondary].size + after[:stale].size
    puts "✗ #{remaining} tag(s) still mismatched after reconcile. Check the failure log above."
  end

  # Trigger Geo's own verification framework so the registry state catches up.
  if matched
    container_repository.replicator.verify_async
    puts "→ Geo verify_async queued. Check verification_state on the registry shortly."
  else
    puts "→ Skipping Geo verify_async because drift remains; address the failures first."
  end
end

# Batch wrapper over `Geo::ContainerRepositoryRegistry.find_each` with
# resumable iteration. Prints a progress checkpoint every `progress_every`
# registries and, on Ctrl+C, prints a ready-to-paste resume command using
# the last touched registry ID. Defaults to `quiet: true` so already-in-sync
# repositories do not flood the console — drift and errors still print.
def reconcile_all(batch_size: 1000, start: nil, execute: false,
                  progress_every: 100, quiet: true)
  last_id = nil
  total = 0
  errors = 0

  begin
    Geo::ContainerRepositoryRegistry.find_each(batch_size: batch_size, start: start) do |reg|
      last_id = reg.id
      total += 1

      begin
        cr = reg.container_repository
        reconcile_container_repository(cr, execute: execute, quiet: quiet) if cr
      rescue StandardError => e
        errors += 1
        puts "✗ Registry ##{reg.id} (CR ##{reg.container_repository_id}): #{e.class}: #{e.message}"
      end

      if (total % progress_every).zero?
        puts ">>> Progress: #{total} processed | last registry ID: #{last_id} | errors: #{errors}"
      end
    end
  rescue Interrupt
    puts ""
    puts ">>> Interrupted at registry ID #{last_id}."
    puts ">>> Resume with: reconcile_all(execute: #{execute}, batch_size: #{batch_size}, start: #{last_id})"
    raise
  end

  puts ""
  puts ">>> Done. Processed: #{total} | errors: #{errors} | last registry ID: #{last_id}"
end

# === Usage ===
#
# Single repo, dry-run:
#   reconcile_container_repository(ContainerRepository.find(<id>))
#
# Single repo, execute:
#   reconcile_container_repository(ContainerRepository.find(<id>), execute: true)
#
# All Geo-tracked repos, dry-run (quiet by default — only drift / errors print):
#   reconcile_all
#
# All Geo-tracked repos, execute mode with a smaller batch and more frequent progress lines:
#   reconcile_all(execute: true, batch_size: 500, progress_every: 50)
#
# Resume after Ctrl+C / dropped session — copy the registry ID from the printed message:
#   reconcile_all(execute: true, batch_size: 500, start: 14567)
#
# Verbose mode (prints header / Before / "Already in sync" for every CR):
#   reconcile_all(execute: true, quiet: false)

Recovery

Once Fix 1 lands:

  • repository_tag_digest returns real digests on both sides.
  • For currently-stale tags, primary and secondary digests differ → tag enters tags_to_sync → sync should heal it on the next pass.
  • For severely stale repos where the old platform manifest blobs have already been garbage-collected, a manual registry resync may still be needed.

Verification snippets (read-only)

Rails console snippets to confirm the issue on any environment

Replace <container-repository-id> and the path/tag placeholders with real values from an affected environment. These are read-only — no manifests are pushed, deleted, or modified.

A. Confirm the count-MATCH is misleading

cr = ContainerRepository.find(<container-repository-id>)
sync = Geo::ContainerRepositorySync.new(cr)

primary_names   = sync.send(:primary_tags).map   { |t| t[:name] }.sort
secondary_names = sync.send(:secondary_tags).map { |t| t[:name] }.sort

puts "Counts: primary=#{primary_names.size} secondary=#{secondary_names.size}"
puts "Only on primary:   #{(primary_names - secondary_names).inspect}"
puts "Only on secondary: #{(secondary_names - primary_names).inspect}"

B. Narrow vs wide Accept header (the core hypothesis test)

cr_path  = '<group>/<project>'
tag_name = 'latest'

token       = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path)
primary_url = Gitlab.config.geo.registry_replication.primary_api_url

# Current code (narrow Accept — missing list/index types)
client = ContainerRegistry::Client.new(primary_url, token: token)
narrow_digest = client.repository_tag_digest(cr_path, tag_name)

# Proposed fix (wide Accept — includes list/index types)
require 'faraday'
conn = Faraday.new(primary_url) do |f|
  f.request :authorization, :bearer, token
  f.adapter :net_http
end
resp = conn.head("/v2/#{cr_path}/manifests/#{tag_name}") do |req|
  req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ')
end
wide_digest = resp.headers[DependencyProxy::Manifest::DIGEST_HEADER]

puts "Narrow Accept (current):  #{narrow_digest.inspect}"
puts "Wide Accept   (fixed):    #{wide_digest.inspect}    status: #{resp.status}"

Expected: narrow_digest is nil; wide_digest is sha256:....

C. Show the silent-skip in tags_to_sync / tags_to_remove

cr = ContainerRepository.find(<container-repository-id>)
sync = Geo::ContainerRepositorySync.new(cr)

primary_tags   = sync.send(:primary_tags)
secondary_tags = sync.send(:secondary_tags)

nil_primary   = primary_tags.select   { |t| t[:digest].nil? }.map { |t| t[:name] }
nil_secondary = secondary_tags.select { |t| t[:digest].nil? }.map { |t| t[:name] }

puts "Primary tags with nil digest:   #{nil_primary.inspect}"
puts "Secondary tags with nil digest: #{nil_secondary.inspect}"
puts "tags_to_sync:   #{sync.send(:tags_to_sync).map   { |t| t[:name] }.inspect}"
puts "tags_to_remove: #{sync.send(:tags_to_remove).map { |t| t[:name] }.inspect}"

D. Check if the submanifest body-mediaType bug is also active

cr_path  = '<group>/<project>'
tag_name = 'latest'

token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path)
primary_client = ContainerRegistry::Client.new(
  Gitlab.config.geo.registry_replication.primary_api_url,
  token: token
)

index_raw = primary_client.repository_raw_manifest(cr_path, tag_name)
index     = Gitlab::Json.safe_parse(index_raw)

puts "Index body mediaType: #{index['mediaType'].inspect}"
index['manifests'].each_with_index do |sub_ref, i|
  sub_raw    = primary_client.repository_raw_manifest(cr_path, sub_ref['digest'])
  sub_parsed = Gitlab::Json.safe_parse(sub_raw)
  puts "  [#{i}] #{sub_ref['digest']}"
  puts "      descriptor.mediaType: #{sub_ref['mediaType'].inspect}"
  puts "      body.mediaType:       #{sub_parsed['mediaType'].inspect}"
end

E. Inspect the secondary's view of the failing tag

cr = ContainerRepository.find(<container-repository-id>)
tag_name = 'latest'

secondary_tag_names = cr.tags.map(&:name)
puts "Tag '#{tag_name}' present on secondary? #{secondary_tag_names.include?(tag_name)}"

if secondary_tag_names.include?(tag_name)
  token = Auth::ContainerRegistryAuthenticationService.full_access_token(cr.path)
  secondary_client = ContainerRegistry::Client.new(Gitlab.config.registry.api_url, token: token)
  begin
    raw    = secondary_client.repository_raw_manifest(cr.path, tag_name)
    parsed = Gitlab::Json.safe_parse(raw)
    puts "Secondary manifest mediaType: #{parsed['mediaType'].inspect}"
    puts raw[0..500]
  rescue StandardError => e
    puts "Could not fetch manifest from secondary: #{e.class}: #{e.message}"
  end
end

F. Geo log evidence (host shell)

grep -E "container_repository_sync|Error while syncing tag" \
  /var/log/gitlab/geo-logcursor/*.log | grep -i '<group>/<project>'

Lines like Error while syncing tag latest: Push manifest error: ... indicate Fix 2 is firing in practice (not just latent).

  • #465580 (closed) — inverse symptom: orphan tags on secondary with digest: nil that can't be deleted. Same Accept-header root cause; that issue also proposes a name-based remove_tag fallback that complements this fix.
  • #465580 (closed) (note 3325972034) — independent identification of the Accept-header gap.

/cc @eakca1

Edited by Douglas Barbosa Alexandre