Geo: Improve error handling of missing PipelineArtifact files
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
Two problems:
- A customer had
Ci::PipelineArtifactrecords with missing files. - In that situation, Geo's verification feature gets into a very confusing state, and it's not clear what investigation and interventions to take.
Ci::PipelineArtifact.verification_failure shows Verification timed out after 28800 and verification_state gets stuck as verification_started instead of verification_failed.
Steps to reproduce
- Configure GitLab Geo
- Run a pipeline that creates a Pipeline Artifact.
- Delete the artifact file on disk.
- If
VerificationBatchWorkeralready ran, then runCi::PipelineArtifact.find(123456).replicator.verify.
Example Project
Happens on SM instances.
What is the current bug behavior?
Verification failure reason is recorded as Verification timed out after 28800 whereas the actual reason is different.
What is the expected correct behavior?
The reason for verification failure on the primary should be clear. It should tell you that the file is unexpectedly missing.
And with that failure message, a sysadmin should be able to find an corresponding Geo Troubleshooting section.
Note that even if the Ci::PipelineArtifact record is "expired", the file is still expected to exist. The record and file are intended to be deleted at the same time by Ci::PipelineArtifacts::ExpireArtifactsWorker.
Relevant logs and/or screenshots
Actual reason on Sidekiq log:
{"severity":"WARN","time":"2021-10-08T11:31:15.727Z","class":"Geo::VerificationBatchWorker","args":["pipeline_artifact"],"retry":0,"queue":"geo:geo_verification_batch","version":0,"queue_namespace":"geo","jid":"2ab3a2fed4150dafa2108d36","created_at":"2021-10-08T11:31:14.868Z","enqueued_at":"2021-10-08T11:31:14.874Z","meta.caller_id":"Geo::VerificationCronWorker","meta.feature_category":"geo_replication","meta.client_id":"ip/","correlation_id":"1a8ee469f50024bc86e2b8f8ae22f9a2","job_size_bytes":21,"pid":12265,"message":"Geo::VerificationBatchWorker JID-2ab3a2fed4150dafa2108d36: fail: 0.488542 sec","job_status":"fail","scheduling_latency_s":0.36454,"redis_calls":4,"redis_duration_s":0.001689,"redis_read_bytes":4,"redis_write_bytes":444,"redis_queues_calls":4,"redis_queues_duration_s":0.001689,"redis_queues_read_bytes":4,"redis_queues_write_bytes":444,"db_count":13,"db_write_count":8,"db_cached_count":0,"cpu_s":0.018853,"mem_objects":9412,"mem_bytes":534904,"mem_mallocs":2816,"mem_total_bytes":911384,"duration_s":0.488542,"completed_at":"2021-10-08T11:31:15.727Z","error_message":"Cannot transition verification_state via :verification_failed from :verification_started (Reason(s): File can't be blank)","error_class":"StateMachines::InvalidTransition","error_backtrace":["ee/lib/gitlab/geo/verification_state.rb:358:in `verification_failed_with_message!'","ee/lib/gitlab/geo/verification_state.rb:324:in `rescue in track_checksum_attempt!'","ee/lib/gitlab/geo/verification_state.rb:308:in `track_checksum_attempt!'","ee/app/models/concerns/geo/verifiable_replicator.rb:209:in `verify'","ee/app/models/concerns/geo/verifiable_replicator.rb:58:in `each'","ee/app/models/concerns/geo/verifiable_replicator.rb:58:in `verify_batch'","ee/app/workers/geo/verification_batch_worker.rb:19:in `perform_work'","app/workers/concerns/limited_capacity/worker.rb:94:in `perform_registered'","app/workers/concerns/limited_capacity/worker.rb:65:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/none.rb:14:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:41:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'","lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:17:in `block in call'","lib/gitlab/application_context.rb:74:in `block in use'","lib/gitlab/application_context.rb:74:in `use'","lib/gitlab/application_context.rb:27:in `with_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:15:in `call'","lib/gitlab/sidekiq_status/server_middleware.rb:7:in `call'","lib/gitlab/sidekiq_versioning/middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/admin_mode/server.rb:14:in `call'","lib/gitlab/sidekiq_middleware/instrumentation_logger.rb:9:in `call'","lib/gitlab/sidekiq_middleware/batch_loader.rb:7:in `call'","lib/gitlab/sidekiq_middleware/extra_done_log_metadata.rb:7:in `call'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:10:in `block in call'","lib/gitlab/with_request_store.rb:17:in `enabling_request_store'","lib/gitlab/with_request_store.rb:10:in `with_request_store'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:29:in `block in call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:52:in `block in instrument'","lib/gitlab/metrics/background_transaction.rb:30:in `run'","lib/gitlab/sidekiq_middleware/server_metrics.rb:52:in `instrument'","lib/gitlab/sidekiq_middleware/server_metrics.rb:28:in `call'","lib/gitlab/sidekiq_middleware/monitor.rb:8:in `block in call'","lib/gitlab/sidekiq_daemon/monitor.rb:49:in `within_job'","lib/gitlab/sidekiq_middleware/monitor.rb:7:in `call'","lib/gitlab/sidekiq_middleware/size_limiter/server.rb:13:in `call'","lib/gitlab/sidekiq_logging/structured_logger.rb:19:in `call'"],"db_duration_s":0.110676}
Example flow in the verification system on the Geo primary site:
- A pipeline artifact file disappears.
- At this point, the
Ci::PipelineArtifactrecord already violates an ActiveRecord validationFile can't be blank, and callingsavewill never succeed. - Geo gets set up for the first time.
- The pipeline artifact record hasn't ever been verified, so
VerificationBatchWorkerpicks it up. -
VerificationBatchWorkerusesupdate_allto atomically mark its batch (including this record) asverification_state==verification_started. -
VerificationBatchWorkercallsverifyon this record'sreplicator. -
calculate_checksumraises because the file is missing. -
verification_failed_with_message!gets called -
verification_failed!gets called -
savefails due toFile can't be blank, soStateMachines::InvalidTransitiongets raised - Now the record is stuck in
verification_started - 8 hours passes
-
VerificationTimeoutWorkerusesupdate_allto change it toverification_failedand setsverification_failuretoVerification timed out after 28800 - Now we are in a loop
-
VerificationBatchWorkerpicks up the record again
Results of GitLab environment info
GitLab 14.1.0 through at least GitLab 17.9.0
Proposal
Save the verification failure by skipping the File can't be blank validation. Lead the sysadmin to resolution.
Implementation Guide
- Add an early guard in
VerificationState#track_checksum_attempt!likereturn verification_failed_invalid_record! unless valid? - Define
verification_failed_invalid_record!to:- Collect the offending model validation errors (i.e.
File can't be blank) - Successfully save the record in the
validation failedstate by skipping validation - Make sure to save the model validation errors to the
verification_failurefield - Make sure that the validation failure is output to logs
- Collect the offending model validation errors (i.e.
The result is that checksumming a Pipeline Artifact with missing file now has the conventional failure File is not checksummable, which is addressed in https://docs.gitlab.com/administration/geo/replication/troubleshooting/synchronization_verification/#message-error-during-verificationerrorfile-is-not-checksummable.
This sister issue is equally important: #534985 (closed)