Geo: Direct upload object stored job artifacts not replicating
Summary
In 16.1.0 - 16.1.3 and 16.2.0 - 16.2.2, new job artifacts will not be replicated by Geo if job artifacts are configured to be stored in object storage and direct_upload is enabled. There may be no sync failure, so the data loss risk is hidden.
This bug is fixed in GitLab versions 16.1.4, 16.2.3, 16.3.0, and later.
Job artifacts which are already affected will continue to be missing. We are working on fixing the data. In the meantime, you can fix it manually:
To fix data
Am I affected?
If you answer yes to all of the following questions, then you are affected. (If you answer no to any question, then you are not affected.):
-
Do you have a secondary Geo site?
-
Did you run GitLab 16.1.0 - 16.1.3 or 16.2.0 - 16.2.2?
-
Does your
gitlab.rb
orgitlab.yaml
of a Rails/Sidekiq node or webservice/sidekiq workload have Object Storage configured for all object types or for job artifacts? -
Do you have
GitLab managed object storage replication
enabled for at least one secondary Geo site? -
In Rails console on those secondary Geo sites, does this script return greater than 0?
- Replace
2023-07-20T09:00Z
with the date and time (or an earlier one) when you upgraded to an affected release. - Replace
2023-08-04T13:00Z
with the date and time (or a later one) when you upgraded to a fixed release.
synced_after = DateTime.parse("2023-07-20T09:00Z") synced_before = DateTime.parse("2023-08-04T13:00Z") registry_rows = Geo::JobArtifactRegistry.synced.where("last_synced_at > ? AND last_synced_at < ?", synced_after, synced_before);registry_rows.count
- Replace
To resync the affected artifacts:
Continued in the same Rails console from Am I affected?, run:
"Marking #{registry_rows.count} rows as pending..."
registry_rows.update_all(state: Geo::JobArtifactRegistry.state_value(:pending), last_synced_at: nil)
The affected artifacts will be resynced by Sidekiq workers in the background according to your normal concurrency settings.
More details
Errors https://sentry.gitlab.net/gitlab/staging-ref/issues/4174439/?query=is%3Aunresolved%20without_bucket_prefix possibly related to the recent !127017 (merged).
GCP logs: https://cloudlogging.app.goo.gl/KAJr5TLQJA3hG4TT9
In GitLab 16.3 (pre-release as of now, 28 July), when Geo attempts to sync job artifacts in object storage, the sync fails with an ArgumentError
in the Geo::EventWorker
job. The error in last_sync_failure
is different prefix: "/" and "."
In GitLab 16.1, the "direct upload to final path" flow was enabled by default (the feature flag was removed). In this release, artifacts which go through the new flow on the primary do not replicate to the secondary site. But they may have been marked as successfully synced. They do not raise the different prefix
error since that particular error comes from a line added in 16.2.
Click here for an example log from staging-ref's secondary Geo site with the error
{
"insertId": "yra1u04ljrwmp5v5",
"jsonPayload": {
"meta.root_caller_id": "Cronjob",
"extra.model_record_id": 49475,
"extra.sidekiq": {
"jid": "323ad64a5305983a2c4050b6",
"args": [
"job_artifact",
"created",
"{\"model_record_id\"=>49475}"
],
"correlation_id": "7a434179583bd1a4ed74b8b4e8c53ffe",
"dead": false,
"version": 0,
"idempotency_key": "resque:gitlab:duplicate:default:3b6b8ed2bc19e576159cfc756191b12fbf9559e8eb21223cae4cc595e2a6d291",
"meta.client_id": "ip/",
"class": "Geo::EventWorker",
"meta.feature_category": "geo_replication",
"queue_namespace": "geo",
"meta.root_caller_id": "Cronjob",
"worker_data_consistency": "always",
"created_at": 1690333750.5597563,
"meta.caller_id": "Geo::RegistrySyncWorker",
"enqueued_at": 1690333750.5616488,
"status_expiration": 1800,
"queue": "default",
"retry": 3,
"size_limiter": "validated"
},
"tags.program": "sidekiq",
"tags.feature_category": "geo_replication",
"tags.locale": "en",
"correlation_id": "7a434179583bd1a4ed74b8b4e8c53ffe",
"meta.caller_id": "Geo::EventWorker",
"meta.client_id": "ip/",
"exception.backtrace": [
"app/uploaders/object_storage.rb:201:in `without_bucket_prefix'",
"app/uploaders/object_storage.rb:34:in `store!'",
"app/uploaders/gitlab_uploader.rb:144:in `replace_file_without_saving!'",
"ee/lib/gitlab/geo/replication/blob_downloader.rb:180:in `download_file'",
"ee/lib/gitlab/geo/replication/blob_downloader.rb:46:in `execute'",
"ee/app/services/geo/blob_download_service.rb:31:in `block in execute'",
"app/services/concerns/exclusive_lease_guard.rb:29:in `try_obtain_lease'",
"ee/app/services/geo/blob_download_service.rb:26:in `execute'",
"ee/app/models/concerns/geo/blob_replicator_strategy.rb:174:in `download'",
"ee/app/models/concerns/geo/blob_replicator_strategy.rb:96:in `resync'",
"ee/app/models/concerns/geo/blob_replicator_strategy.rb:75:in `consume_event_created'",
"ee/lib/gitlab/geo/replicator.rb:269:in `consume'",
"ee/app/services/geo/event_service.rb:18:in `execute'",
"ee/app/workers/geo/event_worker.rb:15:in `perform'",
"lib/gitlab/sidekiq_middleware/skip_jobs.rb:49:in `call'",
"lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb:29:in `call'",
"lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executing.rb:16:in `perform'",
"lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:44:in `perform'",
"lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'",
"lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'",
"lib/gitlab/sidekiq_middleware/worker_context/server.rb:19:in `block in call'",
"lib/gitlab/application_context.rb:124:in `block in use'",
"lib/gitlab/application_context.rb:124:in `use'",
"lib/gitlab/application_context.rb:62:in `with_context'",
"lib/gitlab/sidekiq_middleware/worker_context/server.rb:17:in `call'",
"lib/gitlab/sidekiq_status/server_middleware.rb:7:in `call'",
"lib/gitlab/sidekiq_versioning/middleware.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/query_analyzer.rb:7:in `block in call'",
"lib/gitlab/database/query_analyzer.rb:37:in `within'",
"lib/gitlab/sidekiq_middleware/query_analyzer.rb:7:in `call'",
"lib/gitlab/sidekiq_middleware/admin_mode/server.rb:14:in `call'",
"lib/gitlab/sidekiq_middleware/instrumentation_logger.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/batch_loader.rb:7:in `call'",
"lib/gitlab/sidekiq_middleware/extra_done_log_metadata.rb:7:in `call'",
"lib/gitlab/sidekiq_middleware/request_store_middleware.rb:10:in `block in call'",
"lib/gitlab/with_request_store.rb:17:in `enabling_request_store'",
"lib/gitlab/with_request_store.rb:10:in `with_request_store'",
"lib/gitlab/sidekiq_middleware/request_store_middleware.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:84:in `block in call'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:111:in `block in instrument'",
"lib/gitlab/metrics/background_transaction.rb:33:in `run'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:111:in `instrument'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:83:in `call'",
"lib/gitlab/sidekiq_middleware/monitor.rb:10:in `block in call'",
"lib/gitlab/sidekiq_daemon/monitor.rb:46:in `within_job'",
"lib/gitlab/sidekiq_middleware/monitor.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/size_limiter/server.rb:13:in `call'",
"lib/gitlab/sidekiq_logging/structured_logger.rb:21:in `call'"
],
"tags.correlation_id": "7a434179583bd1a4ed74b8b4e8c53ffe",
"meta.feature_category": "geo_replication",
"subcomponent": "exceptions_json",
"exception.message": "different prefix: \"/\" and \".\"",
"component": "gitlab",
"level": "error",
"exception.class": "ArgumentError",
"extra.replicable_name": "job_artifact",
"user.username": null
},
"resource": {
"type": "k8s_container",
"labels": {
"namespace_name": "default",
"pod_name": "gitlab-sidekiq-all-in-1-v2-5df9c7b9fd-ntw8l",
"project_id": "gitlab-staging-ref",
"container_name": "sidekiq",
"location": "europe-west6-c",
"cluster_name": "staging-ref-3k-hybrid-eu"
}
},
"timestamp": "2023-07-26T01:09:11.802Z",
"severity": "ERROR",
"labels": {
"k8s-pod/app": "sidekiq",
"compute.googleapis.com/resource_name": "gke-staging-ref-3k-h-gl-sidekiq-20230-dbafd31d-v7i8",
"k8s-pod/pod-template-hash": "5df9c7b9fd",
"k8s-pod/queue-pod-name": "all-in-1",
"k8s-pod/release": "gitlab",
"k8s-pod/chart": "sidekiq-7.2.1",
"k8s-pod/heritage": "Helm"
},
"logName": "projects/gitlab-staging-ref/logs/stderr",
"receiveTimestamp": "2023-07-26T01:09:12.376150612Z"
}