Skip to content

Geo: Fix concurrent VerificationBatchWorker jobs

Problem

We're getting errors while checksumming package files on staging.gitlab.com primary.

I briefly enabled geo_package_file_verification with /chatops run feature set geo_package_file_verification true --staging today. There are no package files locally stored in staging.gitlab.com, and sync_object_storage is disabled on the GeoNode record of geo.staging.gitlab.com, so there is nothing to verify. But the primary site raised DB errors in Geo::VerificationBatchWorker:

https://sentry.gitlab.net/gitlab/staginggitlabcom/issues/2465297/?query=is%3Aunresolved

ActiveRecord::Deadlocked

PG::TRDeadlockDetected: ERROR:  deadlock detected
DETAIL:  Process 1306 waits for ShareLock on transaction 2661053543; blocked by process 13203.
Process 13203 waits for ShareLock on transaction 2661053539; blocked by process 1306.
HINT:  See server log for query details.
CONTEXT:  while rechecking updated tuple (28,46) in relation "packages_package_files

https://sentry.gitlab.net/gitlab/staginggitlabcom/issues/2465298/?query=is%3Aunresolved

ActiveRecord::ValueTooLong

PG::StringDataRightTruncation: ERROR:  value too long for type character varying(255)

https://sentry.gitlab.net/gitlab/staginggitlabcom/issues/2465299/?query=is%3Aunresolved

ActiveRecord::QueryCanceled

PG::QueryCanceled: ERROR:  canceling statement due to statement timeout
CONTEXT:  while updating tuple (445,5) in relation "packages_package_files"

https://sentry.gitlab.net/gitlab/staginggitlabcom/issues/2465308/?query=is%3Aunresolved

ActiveRecord::QueryCanceled

PG::QueryCanceled: ERROR:  canceling statement due to statement timeout
CONTEXT:  while rechecking updated tuple (134,32) in relation "packages_package_files"

Proposal

From @mbobin's excellent comment below:

Use .lock('FOR UPDATE SKIP LOCKED') at https://gitlab.com/gitlab-org/gitlab/-/blob/v13.8.2-ee/ee/lib/gitlab/geo/verification_state.rb#L158.

Existing example: https://gitlab.com/gitlab-org/gitlab/-/blob/v13.8.2-ee/app/models/ci/deleted_object.rb#L17 which is called from https://gitlab.com/gitlab-org/gitlab/-/blob/v13.8.2-ee/app/services/ci/delete_objects_service.rb#L41.

Edited by Michael Kozono