Geo: OCI image index tags silently skipped by container repository sync
Summary
Geo::ContainerRepositorySync silently fails to sync container repository
tags whose manifest is an OCI image index
(application/vnd.oci.image.index.v1+json). No error is raised. Tag counts
look fine. But docker pull of the affected tag from the secondary returns
manifest unknown.
A customer reproduction confirmed the root cause and showed the impact is much wider than a single tag — 52 of 62 tags in one affected repository were being silently skipped on every sync cycle.
This is the missing-manifest mirror image of #465580 (closed) (orphan tags on the secondary that can't be deleted). Both have the same root cause.
What the user sees
$ docker pull <secondary>/<group>/<project>:latest
manifest unknown: manifest unknownA naive tag-count check appears to match (e.g. primary: 62 | secondary: 62).
The tag names themselves match too. The problem is that the content
behind the tag on the secondary is stale and points at platform manifests
that no longer exist on the secondary registry.
What's broken
There are two problems, but only the first is what's hitting customers now.
1. The Accept header is too narrow (primary cause, confirmed)
ContainerRegistry::Client#repository_tag_digest sends a HEAD request
with this Accept header:
application/vnd.docker.distribution.manifest.v2+json
application/vnd.oci.image.manifest.v1+jsonIt does NOT include application/vnd.oci.image.index.v1+json. The newer
GitLab container registry enforces Accept strictly for fat manifests
and refuses to return a digest for an OCI index that isn't on the list.
The call returns nil.
Both primary and secondary HEADs return nil for OCI-index tags. So in
the sync code:
# ee/app/services/geo/container_repository_sync.rb
primary_tags # → [{name: "latest", digest: nil}, ...]
secondary_tags # → [{name: "latest", digest: nil}, ...]
tags_to_sync = primary_tags - secondary_tags # → []
tags_to_remove = secondary_tags - primary_tags # → []Identical nil digests cancel out by hash equality. Nothing is
scheduled. No error is logged. The tag is silently skipped.
The secondary's stored OCI index references platform manifest digests
from a previous (now-stale) sync. Those platform manifests no longer
exist on the secondary (garbage-collected, or never pushed). docker pull
finds the index but fails on the inner platform-manifest lookup →
manifest unknown.
Docker manifest lists are not affected because the registry will
content-negotiate to the inner Docker V2 manifest (which IS in Accept).
OCI index has no such fallback.
Downstream impact: Geo verification is also unreliable on legacy installs
The same buggy repository_tag_digest is reached through tag.digest
(lib/container_registry/tag.rb#L72-L76),
which is called by ContainerRepository#tag_list_digest — the checksum
Geo's verification framework uses to confirm replication integrity
(ee/app/replicators/geo/container_repository_replicator.rb#L117).
- With the GitLab Container Registry API enabled,
tag.digestreturns@manifest_digestpopulated bytransform_tags_page(app/models/container_repository.rb#L348) — real digests from the API. Verification works correctly and has been flagging affected registries asverification_failed. - Without the API (legacy installs),
tag.digestfalls back to the buggyrepository_tag_digest. Both sides compute nil-filled checksums that happen to match. Verification silently reports success on broken data.
Fix 1 cleans this up automatically — once repository_tag_digest returns
real digests, tag_list_digest becomes meaningful again on every install.
2. Submanifest push uses an optional field with no fallback (latent)
In ee/app/services/geo/container_repository_sync.rb#L53-L63:
container_repository.push_manifest(
submanifest_ref['digest'],
submanifest_raw,
submanifest_parsed['mediaType'] # ← can be nil
)Per the
OCI image manifest spec,
body mediaType is optional. docker buildx / buildkit frequently
omit it. When nil, the push goes out with no Content-Type, the registry
rejects it, the per-tag rescue swallows the error, and the parent OCI
index push at line 74 never runs.
The parent-manifest path at line 73 already has the correct fallback. The submanifest path was missed.
This wasn't the cause for the customer we investigated (both platform
manifests carried body mediaType), but it's a real latent bug that
should be fixed in the same MR.
Customer reproduction summary
Full details in note 3363582817.
- Same
HEADrequest, narrowAccept→nil; wideAccept→sha256:599be818.... Reproduced live. - 52 of 62 primary tags and 51 of 62 secondary tags resolve to
digest: nil. tags_to_sync: [],tags_to_remove: [].- Tag names match. Symptom is staleness, not absence.
- Both platform submanifests for
:latestcarry bodymediaType— Fix 2 not firing for this case.
Proposed fix
Two coordinated changes in a single MR.
Fix 1 — widen ACCEPTED_TYPES (primary fix)
lib/container_registry/base_client.rb:
ACCEPTED_TYPES = [
DOCKER_DISTRIBUTION_MANIFEST_V2_TYPE,
OCI_MANIFEST_V1_TYPE,
DOCKER_DISTRIBUTION_MANIFEST_LIST_V2_TYPE,
OCI_DISTRIBUTION_INDEX_TYPE
].freezeCollapses ACCEPTED_TYPES and ACCEPTED_TYPES_RAW to a single list,
matching the existing pattern in
ee/app/models/virtual_registries/container/upstream.rb.
Fix 2 — submanifest mediaType fallback (latent robustness)
ee/app/services/geo/container_repository_sync.rb:
manifest_parsed['manifests'].each do |submanifest_ref|
submanifest_raw = client.repository_raw_manifest(repository_path, submanifest_ref['digest'])
submanifest_parsed = Gitlab::Json.safe_parse(submanifest_raw)
sync_manifest_blobs(submanifest_parsed)
submanifest_media_type =
submanifest_parsed['mediaType'] ||
submanifest_ref['mediaType'] ||
ContainerRegistry::Client::OCI_MANIFEST_V1_TYPE
container_repository.push_manifest(
submanifest_ref['digest'],
submanifest_raw,
submanifest_media_type
)
endThe descriptor's mediaType is REQUIRED by the OCI spec, so it's a
reliable fallback.
Fix 3 — read secondary_tags from the Docker V2 client
Geo::ContainerRepositorySync#primary_tags reads digests from the
Docker V2 client (client.repository_tag_digest). #secondary_tags
went through container_repository.tags, which prefers the GitLab
Container Registry API client (gitlab_api_client) when available.
The two paths can return digests in different formats for the same
manifest content, so even after Fix 1 closes the nil-digest gap,
residual asymmetry could still cause false-MATCH / false-MISMATCH on
primary_tags - secondary_tags.
Mirror the primary client setup on the secondary so both sides resolve digests through the same Docker V2 code path:
def secondary_client
strong_memoize_with_expiration(:secondary_client, ContainerRepository.registry_client_expiration_time) do
ContainerRegistry::Client.new(
Gitlab.config.registry.api_url,
token: ::Auth::ContainerRegistryAuthenticationService.pull_access_token(repository_path)
)
end
end
def secondary_tags
strong_memoize(:secondary_tags) do
manifest = secondary_client.repository_tags(repository_path)
next [] unless manifest && manifest['tags']
manifest['tags'].map do |tag|
{ name: tag, digest: secondary_client.repository_tag_digest(repository_path, tag) }
end
end
endTest gaps to close
ee/spec/services/geo/container_repository_sync_spec.rb: extend the OCI manifest list context with a case where the inner manifest body omitsmediaType. Assertpush_manifestis called with a non-nilContent-Type— existing tests passanythingfor that argument, which is why Fix 2 slipped through.ee/spec/lib/container_registry/client_spec.rb: updateAccept-header expectations to match the merged list.- Once Fix 3 lands,
container_repository.tagsis no longer called from sync, so the "when the GitLab API is / is not supported" context splits collapse into a single Docker V2 stub path.
Workaround for affected installs (no upgrade required)
For customers on a version that doesn't yet include Fix 1, the Rails console script below reconciles container repository tags by bypassing the buggy comparison path:
- Lists tag names on both sides via the standard Docker V2 API.
- For shared tags, compares digests using
HEADwith the wideAcceptheader — the same workaround the fix applies. - Force-resyncs missing and stale tags by calling the private
sync_tagmethod directly (which uses the wideAcceptheader internally). - Removes orphan tags (only on secondary) via
DELETE /v2/<path>/manifests/<tag>with the wideAcceptheader — the workaround from #465580 (note 3325972034). Requires GitLab Container Registry 16.4+. - Re-verifies post-reconcile using the same wide-
Acceptcomparison. - Triggers Geo's verification framework so the registry's
verification_statecatches up.
Safe to run in dry-run mode (the default). Pass execute: true to apply.
For installs with many Geo::ContainerRepositoryRegistry rows, the
script ships with a reconcile_all wrapper that supports tunable
batch_size: and a start: ID for resuming an interrupted run. Ctrl+C
prints a ready-to-paste resume command containing the last touched
registry ID, and reconcile_all defaults to a quiet mode that keeps
the output proportional to the number of repos that actually need work.
Check Geo verification status before / after
Geo::ContainerRepositoryRegistry.find_each do |reg|
puts "CR ##{reg.container_repository_id} state=#{reg.state} verification_state=#{reg.verification_state}"
endRegistries in verification_state=4 (failed) before the reconcile are
the ones the script should target. After a successful reconcile and the
verify_async it triggers, they should flip to verification_state=2
(succeeded).
Reconcile script (Rails console, runs on the SECONDARY)
# Workaround for #600486 — reconcile container repository tags on a Geo SECONDARY
# Run in the secondary's Rails console.
#
# Detects drift between primary and secondary using HEAD with the wide Accept
# header (the gap that #600486 fixes), force-resyncs missing/stale tags by
# calling the private `sync_tag` method directly, removes orphan tags via the
# OCI tag-delete endpoint (the workaround from #465580 note 3325972034),
# re-verifies with our own wide-Accept check, and then triggers Geo's
# verification framework so the registry's verification_state catches up.
#
# Why our verification check works despite the bug: the buggy code path uses
# the narrow Accept header and is only exercised during the normal sync flow.
# This script uses the wide Accept header directly.
#
# Note on Geo verification (the framework one):
# - On installs with the GitLab Container Registry API enabled,
# `tag_list_digest` reads `@manifest_digest` populated by the API client —
# real digests — so verification works correctly and our `verify_async`
# trigger is meaningful.
# - On installs WITHOUT the API, `tag.digest` falls back to the buggy
# `client.repository_tag_digest`, so both sides compute nil-filled
# checksums that match. Verification reports success on broken data.
# Affected installs need the upstream fix (#600486) before Geo verification
# becomes trustworthy.
#
# Caveats:
# - The OCI tag-delete endpoint requires GitLab Container Registry 16.4+.
# - If an OCI submanifest body omits `mediaType` (Fix 2 latent bug), the
# per-tag push will fail. The script logs the failure and the
# re-verification will surface the remaining drift.
require 'faraday'
def wide_head_digest(faraday_conn, path, tag_name)
resp = faraday_conn.head("/v2/#{path}/manifests/#{tag_name}") do |req|
req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ')
end
return nil unless resp.success?
resp.headers[DependencyProxy::Manifest::DIGEST_HEADER]
end
def wide_delete_tag(faraday_conn, path, tag_name)
faraday_conn.delete("/v2/#{path}/manifests/#{tag_name}") do |req|
req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ')
end
end
def build_conn(url, token)
# Bump default Net::HTTP timeouts (60s) for slow registries under load.
# Tune these higher if you still see Faraday::TimeoutError / Net::ReadTimeout.
Faraday.new(url, request: { open_timeout: 60, read_timeout: 120, timeout: 120 }) do |f|
f.request :authorization, :bearer, token
f.adapter :net_http
end
end
def check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn)
primary_names = (primary_client.repository_tags(cr_path)&.dig('tags') || []).sort
secondary_names = (secondary_client.repository_tags(cr_path)&.dig('tags') || []).sort
only_on_primary = primary_names - secondary_names
only_on_secondary = secondary_names - primary_names
shared = primary_names & secondary_names
stale = shared.select do |name|
wide_head_digest(primary_conn, cr_path, name) != wide_head_digest(secondary_conn, cr_path, name)
end
{
primary_count: primary_names.size,
secondary_count: secondary_names.size,
only_on_primary: only_on_primary,
only_on_secondary: only_on_secondary,
stale: stale
}
end
def print_state(label, s)
puts "#{label}:"
puts " Primary tags: #{s[:primary_count]}"
puts " Secondary tags: #{s[:secondary_count]}"
puts " Only on primary: #{s[:only_on_primary].size}"
puts " Only on secondary: #{s[:only_on_secondary].size}"
puts " Stale on secondary: #{s[:stale].size}"
end
def reconcile_container_repository(container_repository, execute: false, quiet: false)
cr_path = container_repository.path
sync = Geo::ContainerRepositorySync.new(container_repository)
primary_token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path)
secondary_token = Auth::ContainerRegistryAuthenticationService.full_access_token(cr_path)
primary_url = Gitlab.config.geo.registry_replication.primary_api_url
secondary_url = Gitlab.config.registry.api_url
primary_conn = build_conn(primary_url, primary_token)
secondary_conn = build_conn(secondary_url, secondary_token)
primary_client = ContainerRegistry::Client.new(primary_url, token: primary_token)
secondary_client = ContainerRegistry::Client.new(secondary_url, token: secondary_token)
before = check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn)
tags_to_resync = before[:only_on_primary] + before[:stale]
tags_to_remove = before[:only_on_secondary]
in_sync = tags_to_resync.empty? && tags_to_remove.empty?
# Quiet mode: skip output for already-in-sync repos. Drift / errors still print.
return if in_sync && quiet
puts "=" * 70
puts "CR ##{container_repository.id} #{cr_path} (execute: #{execute})"
print_state("Before", before)
if in_sync
puts "Already in sync — nothing to do."
return
end
unless execute
puts "DRY RUN."
puts " Resync candidates (first 10): #{tags_to_resync.first(10).inspect}" if tags_to_resync.any?
puts " Remove candidates (first 10): #{tags_to_remove.first(10).inspect}" if tags_to_remove.any?
return
end
resync_ok = resync_failed = 0
tags_to_resync.each do |name|
sync.send(:sync_tag, { name: name })
resync_ok += 1
puts " resync ✓ #{name}"
rescue StandardError => e
resync_failed += 1
puts " resync ✗ #{name} #{e.class}: #{e.message}"
end
remove_ok = remove_failed = 0
tags_to_remove.each do |name|
resp = wide_delete_tag(secondary_conn, cr_path, name)
if resp.success? || resp.status == 404
remove_ok += 1
puts " remove ✓ #{name} (status #{resp.status})"
else
remove_failed += 1
puts " remove ✗ #{name} status=#{resp.status} body=#{resp.body.to_s[0..200]}"
end
rescue StandardError => e
remove_failed += 1
puts " remove ✗ #{name} #{e.class}: #{e.message}"
end
puts "Done. resync ok=#{resync_ok} failed=#{resync_failed} remove ok=#{remove_ok} failed=#{remove_failed}"
puts ""
after = check_drift(cr_path, primary_client, secondary_client, primary_conn, secondary_conn)
print_state("After", after)
matched = after[:only_on_primary].empty? && after[:only_on_secondary].empty? && after[:stale].empty?
if matched
puts "✓ Verified: secondary fully matches primary."
else
remaining = after[:only_on_primary].size + after[:only_on_secondary].size + after[:stale].size
puts "✗ #{remaining} tag(s) still mismatched after reconcile. Check the failure log above."
end
# Trigger Geo's own verification framework so the registry state catches up.
if matched
container_repository.replicator.verify_async
puts "→ Geo verify_async queued. Check verification_state on the registry shortly."
else
puts "→ Skipping Geo verify_async because drift remains; address the failures first."
end
end
# Batch wrapper over `Geo::ContainerRepositoryRegistry.find_each` with
# resumable iteration. Prints a progress checkpoint every `progress_every`
# registries and, on Ctrl+C, prints a ready-to-paste resume command using
# the last touched registry ID. Defaults to `quiet: true` so already-in-sync
# repositories do not flood the console — drift and errors still print.
def reconcile_all(batch_size: 1000, start: nil, execute: false,
progress_every: 100, quiet: true)
last_id = nil
total = 0
errors = 0
begin
Geo::ContainerRepositoryRegistry.find_each(batch_size: batch_size, start: start) do |reg|
last_id = reg.id
total += 1
begin
cr = reg.container_repository
reconcile_container_repository(cr, execute: execute, quiet: quiet) if cr
rescue StandardError => e
errors += 1
puts "✗ Registry ##{reg.id} (CR ##{reg.container_repository_id}): #{e.class}: #{e.message}"
end
if (total % progress_every).zero?
puts ">>> Progress: #{total} processed | last registry ID: #{last_id} | errors: #{errors}"
end
end
rescue Interrupt
puts ""
puts ">>> Interrupted at registry ID #{last_id}."
puts ">>> Resume with: reconcile_all(execute: #{execute}, batch_size: #{batch_size}, start: #{last_id})"
raise
end
puts ""
puts ">>> Done. Processed: #{total} | errors: #{errors} | last registry ID: #{last_id}"
end
# === Usage ===
#
# Single repo, dry-run:
# reconcile_container_repository(ContainerRepository.find(<id>))
#
# Single repo, execute:
# reconcile_container_repository(ContainerRepository.find(<id>), execute: true)
#
# All Geo-tracked repos, dry-run (quiet by default — only drift / errors print):
# reconcile_all
#
# All Geo-tracked repos, execute mode with a smaller batch and more frequent progress lines:
# reconcile_all(execute: true, batch_size: 500, progress_every: 50)
#
# Resume after Ctrl+C / dropped session — copy the registry ID from the printed message:
# reconcile_all(execute: true, batch_size: 500, start: 14567)
#
# Verbose mode (prints header / Before / "Already in sync" for every CR):
# reconcile_all(execute: true, quiet: false)Recovery
Once Fix 1 lands:
repository_tag_digestreturns real digests on both sides.- For currently-stale tags, primary and secondary digests differ → tag
enters
tags_to_sync→ sync should heal it on the next pass. - For severely stale repos where the old platform manifest blobs have already been garbage-collected, a manual registry resync may still be needed.
Verification snippets (read-only)
Rails console snippets to confirm the issue on any environment
Replace <container-repository-id> and the path/tag placeholders with
real values from an affected environment. These are read-only — no
manifests are pushed, deleted, or modified.
A. Confirm the count-MATCH is misleading
cr = ContainerRepository.find(<container-repository-id>)
sync = Geo::ContainerRepositorySync.new(cr)
primary_names = sync.send(:primary_tags).map { |t| t[:name] }.sort
secondary_names = sync.send(:secondary_tags).map { |t| t[:name] }.sort
puts "Counts: primary=#{primary_names.size} secondary=#{secondary_names.size}"
puts "Only on primary: #{(primary_names - secondary_names).inspect}"
puts "Only on secondary: #{(secondary_names - primary_names).inspect}"B. Narrow vs wide Accept header (the core hypothesis test)
cr_path = '<group>/<project>'
tag_name = 'latest'
token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path)
primary_url = Gitlab.config.geo.registry_replication.primary_api_url
# Current code (narrow Accept — missing list/index types)
client = ContainerRegistry::Client.new(primary_url, token: token)
narrow_digest = client.repository_tag_digest(cr_path, tag_name)
# Proposed fix (wide Accept — includes list/index types)
require 'faraday'
conn = Faraday.new(primary_url) do |f|
f.request :authorization, :bearer, token
f.adapter :net_http
end
resp = conn.head("/v2/#{cr_path}/manifests/#{tag_name}") do |req|
req.headers['Accept'] = ContainerRegistry::Client::ACCEPTED_TYPES_RAW.join(', ')
end
wide_digest = resp.headers[DependencyProxy::Manifest::DIGEST_HEADER]
puts "Narrow Accept (current): #{narrow_digest.inspect}"
puts "Wide Accept (fixed): #{wide_digest.inspect} status: #{resp.status}"Expected: narrow_digest is nil; wide_digest is sha256:....
C. Show the silent-skip in tags_to_sync / tags_to_remove
cr = ContainerRepository.find(<container-repository-id>)
sync = Geo::ContainerRepositorySync.new(cr)
primary_tags = sync.send(:primary_tags)
secondary_tags = sync.send(:secondary_tags)
nil_primary = primary_tags.select { |t| t[:digest].nil? }.map { |t| t[:name] }
nil_secondary = secondary_tags.select { |t| t[:digest].nil? }.map { |t| t[:name] }
puts "Primary tags with nil digest: #{nil_primary.inspect}"
puts "Secondary tags with nil digest: #{nil_secondary.inspect}"
puts "tags_to_sync: #{sync.send(:tags_to_sync).map { |t| t[:name] }.inspect}"
puts "tags_to_remove: #{sync.send(:tags_to_remove).map { |t| t[:name] }.inspect}"D. Check if the submanifest body-mediaType bug is also active
cr_path = '<group>/<project>'
tag_name = 'latest'
token = Auth::ContainerRegistryAuthenticationService.pull_access_token(cr_path)
primary_client = ContainerRegistry::Client.new(
Gitlab.config.geo.registry_replication.primary_api_url,
token: token
)
index_raw = primary_client.repository_raw_manifest(cr_path, tag_name)
index = Gitlab::Json.safe_parse(index_raw)
puts "Index body mediaType: #{index['mediaType'].inspect}"
index['manifests'].each_with_index do |sub_ref, i|
sub_raw = primary_client.repository_raw_manifest(cr_path, sub_ref['digest'])
sub_parsed = Gitlab::Json.safe_parse(sub_raw)
puts " [#{i}] #{sub_ref['digest']}"
puts " descriptor.mediaType: #{sub_ref['mediaType'].inspect}"
puts " body.mediaType: #{sub_parsed['mediaType'].inspect}"
endE. Inspect the secondary's view of the failing tag
cr = ContainerRepository.find(<container-repository-id>)
tag_name = 'latest'
secondary_tag_names = cr.tags.map(&:name)
puts "Tag '#{tag_name}' present on secondary? #{secondary_tag_names.include?(tag_name)}"
if secondary_tag_names.include?(tag_name)
token = Auth::ContainerRegistryAuthenticationService.full_access_token(cr.path)
secondary_client = ContainerRegistry::Client.new(Gitlab.config.registry.api_url, token: token)
begin
raw = secondary_client.repository_raw_manifest(cr.path, tag_name)
parsed = Gitlab::Json.safe_parse(raw)
puts "Secondary manifest mediaType: #{parsed['mediaType'].inspect}"
puts raw[0..500]
rescue StandardError => e
puts "Could not fetch manifest from secondary: #{e.class}: #{e.message}"
end
endF. Geo log evidence (host shell)
grep -E "container_repository_sync|Error while syncing tag" \
/var/log/gitlab/geo-logcursor/*.log | grep -i '<group>/<project>'Lines like Error while syncing tag latest: Push manifest error: ...
indicate Fix 2 is firing in practice (not just latent).
Related
- #465580 (closed) — inverse symptom: orphan tags on secondary with
digest: nilthat can't be deleted. Same Accept-header root cause; that issue also proposes a name-basedremove_tagfallback that complements this fix. - #465580 (closed) (note 3325972034) — independent identification of the Accept-header gap.
/cc @eakca1