Skip to content

Geo: Improve error handling of missing PipelineArtifact files

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

Two problems:

  1. A customer had Ci::PipelineArtifact records with missing files.
  2. In that situation, Geo's verification feature gets into a very confusing state, and it's not clear what investigation and interventions to take.

Ci::PipelineArtifact.verification_failure shows Verification timed out after 28800 and verification_state gets stuck as verification_started instead of verification_failed.

Steps to reproduce

  • Configure GitLab Geo
  • Run a pipeline that creates a Pipeline Artifact.
  • Delete the artifact file on disk.
  • If VerificationBatchWorker already ran, then run Ci::PipelineArtifact.find(123456).replicator.verify.

Example Project

Happens on SM instances.

What is the current bug behavior?

Verification failure reason is recorded as Verification timed out after 28800 whereas the actual reason is different.

What is the expected correct behavior?

The reason for verification failure on the primary should be clear. It should tell you that the file is unexpectedly missing.

And with that failure message, a sysadmin should be able to find an corresponding Geo Troubleshooting section.

Note that even if the Ci::PipelineArtifact record is "expired", the file is still expected to exist. The record and file are intended to be deleted at the same time by Ci::PipelineArtifacts::ExpireArtifactsWorker.

Relevant logs and/or screenshots

Actual reason on Sidekiq log:

{"severity":"WARN","time":"2021-10-08T11:31:15.727Z","class":"Geo::VerificationBatchWorker","args":["pipeline_artifact"],"retry":0,"queue":"geo:geo_verification_batch","version":0,"queue_namespace":"geo","jid":"2ab3a2fed4150dafa2108d36","created_at":"2021-10-08T11:31:14.868Z","enqueued_at":"2021-10-08T11:31:14.874Z","meta.caller_id":"Geo::VerificationCronWorker","meta.feature_category":"geo_replication","meta.client_id":"ip/","correlation_id":"1a8ee469f50024bc86e2b8f8ae22f9a2","job_size_bytes":21,"pid":12265,"message":"Geo::VerificationBatchWorker JID-2ab3a2fed4150dafa2108d36: fail: 0.488542 sec","job_status":"fail","scheduling_latency_s":0.36454,"redis_calls":4,"redis_duration_s":0.001689,"redis_read_bytes":4,"redis_write_bytes":444,"redis_queues_calls":4,"redis_queues_duration_s":0.001689,"redis_queues_read_bytes":4,"redis_queues_write_bytes":444,"db_count":13,"db_write_count":8,"db_cached_count":0,"cpu_s":0.018853,"mem_objects":9412,"mem_bytes":534904,"mem_mallocs":2816,"mem_total_bytes":911384,"duration_s":0.488542,"completed_at":"2021-10-08T11:31:15.727Z","error_message":"Cannot transition verification_state via :verification_failed from :verification_started (Reason(s): File can't be blank)","error_class":"StateMachines::InvalidTransition","error_backtrace":["ee/lib/gitlab/geo/verification_state.rb:358:in `verification_failed_with_message!'","ee/lib/gitlab/geo/verification_state.rb:324:in `rescue in track_checksum_attempt!'","ee/lib/gitlab/geo/verification_state.rb:308:in `track_checksum_attempt!'","ee/app/models/concerns/geo/verifiable_replicator.rb:209:in `verify'","ee/app/models/concerns/geo/verifiable_replicator.rb:58:in `each'","ee/app/models/concerns/geo/verifiable_replicator.rb:58:in `verify_batch'","ee/app/workers/geo/verification_batch_worker.rb:19:in `perform_work'","app/workers/concerns/limited_capacity/worker.rb:94:in `perform_registered'","app/workers/concerns/limited_capacity/worker.rb:65:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/none.rb:14:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:41:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'","lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:17:in `block in call'","lib/gitlab/application_context.rb:74:in `block in use'","lib/gitlab/application_context.rb:74:in `use'","lib/gitlab/application_context.rb:27:in `with_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:15:in `call'","lib/gitlab/sidekiq_status/server_middleware.rb:7:in `call'","lib/gitlab/sidekiq_versioning/middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/admin_mode/server.rb:14:in `call'","lib/gitlab/sidekiq_middleware/instrumentation_logger.rb:9:in `call'","lib/gitlab/sidekiq_middleware/batch_loader.rb:7:in `call'","lib/gitlab/sidekiq_middleware/extra_done_log_metadata.rb:7:in `call'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:10:in `block in call'","lib/gitlab/with_request_store.rb:17:in `enabling_request_store'","lib/gitlab/with_request_store.rb:10:in `with_request_store'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:29:in `block in call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:52:in `block in instrument'","lib/gitlab/metrics/background_transaction.rb:30:in `run'","lib/gitlab/sidekiq_middleware/server_metrics.rb:52:in `instrument'","lib/gitlab/sidekiq_middleware/server_metrics.rb:28:in `call'","lib/gitlab/sidekiq_middleware/monitor.rb:8:in `block in call'","lib/gitlab/sidekiq_daemon/monitor.rb:49:in `within_job'","lib/gitlab/sidekiq_middleware/monitor.rb:7:in `call'","lib/gitlab/sidekiq_middleware/size_limiter/server.rb:13:in `call'","lib/gitlab/sidekiq_logging/structured_logger.rb:19:in `call'"],"db_duration_s":0.110676}

Example flow in the verification system on the Geo primary site:

  • A pipeline artifact file disappears.
  • At this point, the Ci::PipelineArtifact record already violates an ActiveRecord validation File can't be blank, and calling save will never succeed.
  • Geo gets set up for the first time.
  • The pipeline artifact record hasn't ever been verified, so VerificationBatchWorker picks it up.
  • VerificationBatchWorker uses update_all to atomically mark its batch (including this record) as verification_state == verification_started.
  • VerificationBatchWorker calls verify on this record's replicator.
  • calculate_checksum raises because the file is missing.
  • verification_failed_with_message! gets called
  • verification_failed! gets called
  • save fails due to File can't be blank, so StateMachines::InvalidTransition gets raised
  • Now the record is stuck in verification_started
  • 8 hours passes
  • VerificationTimeoutWorker uses update_all to change it to verification_failed and sets verification_failure to Verification timed out after 28800
  • Now we are in a loop
  • VerificationBatchWorker picks up the record again

Results of GitLab environment info

GitLab 14.1.0 through at least GitLab 17.9.0

Proposal

Save the verification failure by skipping the File can't be blank validation. Lead the sysadmin to resolution.

Implementation Guide

  • Add an early guard in VerificationState#track_checksum_attempt! like return verification_failed_invalid_record! unless valid?
  • Define verification_failed_invalid_record! to:
    • Collect the offending model validation errors (i.e. File can't be blank)
    • Successfully save the record in the validation failed state by skipping validation
    • Make sure to save the model validation errors to the verification_failure field
    • Make sure that the validation failure is output to logs

The result is that checksumming a Pipeline Artifact with missing file now has the conventional failure File is not checksummable, which is addressed in https://docs.gitlab.com/administration/geo/replication/troubleshooting/synchronization_verification/#message-error-during-verificationerrorfile-is-not-checksummable.

This sister issue is equally important: #534985 (closed)

Edited by 🤖 GitLab Bot 🤖