Geo: Improve error handling of missing PipelineArtifact files
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
Two problems:
- A customer had
Ci::PipelineArtifact
records with missing files. - In that situation, Geo's verification feature gets into a very confusing state, and it's not clear what investigation and interventions to take.
Ci::PipelineArtifact.verification_failure
shows Verification timed out after 28800
and verification_state
gets stuck as verification_started
instead of verification_failed
.
Steps to reproduce
- Configure GitLab Geo
- Run a pipeline that creates a Pipeline Artifact.
- Delete the artifact file on disk.
- If
VerificationBatchWorker
already ran, then runCi::PipelineArtifact.find(123456).replicator.verify
.
Example Project
Happens on SM instances.
What is the current bug behavior?
Verification failure reason is recorded as Verification timed out after 28800
whereas the actual reason is different.
What is the expected correct behavior?
The reason for verification failure on the primary should be clear. It should tell you that the file is unexpectedly missing.
And with that failure message, a sysadmin should be able to find an corresponding Geo Troubleshooting section.
Note that even if the Ci::PipelineArtifact
record is "expired", the file is still expected to exist. The record and file are intended to be deleted at the same time by Ci::PipelineArtifacts::ExpireArtifactsWorker
.
Relevant logs and/or screenshots
Actual reason on Sidekiq log:
{"severity":"WARN","time":"2021-10-08T11:31:15.727Z","class":"Geo::VerificationBatchWorker","args":["pipeline_artifact"],"retry":0,"queue":"geo:geo_verification_batch","version":0,"queue_namespace":"geo","jid":"2ab3a2fed4150dafa2108d36","created_at":"2021-10-08T11:31:14.868Z","enqueued_at":"2021-10-08T11:31:14.874Z","meta.caller_id":"Geo::VerificationCronWorker","meta.feature_category":"geo_replication","meta.client_id":"ip/","correlation_id":"1a8ee469f50024bc86e2b8f8ae22f9a2","job_size_bytes":21,"pid":12265,"message":"Geo::VerificationBatchWorker JID-2ab3a2fed4150dafa2108d36: fail: 0.488542 sec","job_status":"fail","scheduling_latency_s":0.36454,"redis_calls":4,"redis_duration_s":0.001689,"redis_read_bytes":4,"redis_write_bytes":444,"redis_queues_calls":4,"redis_queues_duration_s":0.001689,"redis_queues_read_bytes":4,"redis_queues_write_bytes":444,"db_count":13,"db_write_count":8,"db_cached_count":0,"cpu_s":0.018853,"mem_objects":9412,"mem_bytes":534904,"mem_mallocs":2816,"mem_total_bytes":911384,"duration_s":0.488542,"completed_at":"2021-10-08T11:31:15.727Z","error_message":"Cannot transition verification_state via :verification_failed from :verification_started (Reason(s): File can't be blank)","error_class":"StateMachines::InvalidTransition","error_backtrace":["ee/lib/gitlab/geo/verification_state.rb:358:in `verification_failed_with_message!'","ee/lib/gitlab/geo/verification_state.rb:324:in `rescue in track_checksum_attempt!'","ee/lib/gitlab/geo/verification_state.rb:308:in `track_checksum_attempt!'","ee/app/models/concerns/geo/verifiable_replicator.rb:209:in `verify'","ee/app/models/concerns/geo/verifiable_replicator.rb:58:in `each'","ee/app/models/concerns/geo/verifiable_replicator.rb:58:in `verify_batch'","ee/app/workers/geo/verification_batch_worker.rb:19:in `perform_work'","app/workers/concerns/limited_capacity/worker.rb:94:in `perform_registered'","app/workers/concerns/limited_capacity/worker.rb:65:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/none.rb:14:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:41:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'","lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:17:in `block in call'","lib/gitlab/application_context.rb:74:in `block in use'","lib/gitlab/application_context.rb:74:in `use'","lib/gitlab/application_context.rb:27:in `with_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:15:in `call'","lib/gitlab/sidekiq_status/server_middleware.rb:7:in `call'","lib/gitlab/sidekiq_versioning/middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/admin_mode/server.rb:14:in `call'","lib/gitlab/sidekiq_middleware/instrumentation_logger.rb:9:in `call'","lib/gitlab/sidekiq_middleware/batch_loader.rb:7:in `call'","lib/gitlab/sidekiq_middleware/extra_done_log_metadata.rb:7:in `call'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:10:in `block in call'","lib/gitlab/with_request_store.rb:17:in `enabling_request_store'","lib/gitlab/with_request_store.rb:10:in `with_request_store'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:29:in `block in call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:52:in `block in instrument'","lib/gitlab/metrics/background_transaction.rb:30:in `run'","lib/gitlab/sidekiq_middleware/server_metrics.rb:52:in `instrument'","lib/gitlab/sidekiq_middleware/server_metrics.rb:28:in `call'","lib/gitlab/sidekiq_middleware/monitor.rb:8:in `block in call'","lib/gitlab/sidekiq_daemon/monitor.rb:49:in `within_job'","lib/gitlab/sidekiq_middleware/monitor.rb:7:in `call'","lib/gitlab/sidekiq_middleware/size_limiter/server.rb:13:in `call'","lib/gitlab/sidekiq_logging/structured_logger.rb:19:in `call'"],"db_duration_s":0.110676}
Example flow in the verification system on the Geo primary site:
- A pipeline artifact file disappears.
- At this point, the
Ci::PipelineArtifact
record already violates an ActiveRecord validationFile can't be blank
, and callingsave
will never succeed. - Geo gets set up for the first time.
- The pipeline artifact record hasn't ever been verified, so
VerificationBatchWorker
picks it up. -
VerificationBatchWorker
usesupdate_all
to atomically mark its batch (including this record) asverification_state
==verification_started
. -
VerificationBatchWorker
callsverify
on this record'sreplicator
. -
calculate_checksum
raises because the file is missing. -
verification_failed_with_message!
gets called -
verification_failed!
gets called -
save
fails due toFile can't be blank
, soStateMachines::InvalidTransition
gets raised - Now the record is stuck in
verification_started
- 8 hours passes
-
VerificationTimeoutWorker
usesupdate_all
to change it toverification_failed
and setsverification_failure
toVerification timed out after 28800
- Now we are in a loop
-
VerificationBatchWorker
picks up the record again
Results of GitLab environment info
GitLab 14.1.0 through at least GitLab 17.9.0
Proposal
Save the verification failure by skipping the File can't be blank
validation. Lead the sysadmin to resolution.
Implementation Guide
- Add an early guard in
VerificationState#track_checksum_attempt!
likereturn verification_failed_invalid_record! unless valid?
- Define
verification_failed_invalid_record!
to:- Collect the offending model validation errors (i.e.
File can't be blank
) - Successfully save the record in the
validation failed
state by skipping validation - Make sure to save the model validation errors to the
verification_failure
field - Make sure that the validation failure is output to logs
- Collect the offending model validation errors (i.e.
The result is that checksumming a Pipeline Artifact with missing file now has the conventional failure File is not checksummable
, which is addressed in https://docs.gitlab.com/administration/geo/replication/troubleshooting/synchronization_verification/#message-error-during-verificationerrorfile-is-not-checksummable.
This sister issue is equally important: #534985 (closed)