Bug: Geo primary verification state backfill job can exceed batch size

Summary

In certain cases, Geo::VerificationStateBackfillWorker can operate on a batch of rows which exceeds the specified batch size. Causing or additionally contributing to memory spikes #429242 (closed).

Steps to reproduce

Suspected steps to reproduce

I noticed this while producing a possible workaround for #429242 (comment 1634287632).

  1. Have many (e.g. 200000) rows of one data type, e.g. uploads
  2. Monitor memory usage of Sidekiq
  3. Set up Geo for the first time
  4. Create a new upload e.g. add an attachment to an issue

In gdk psql on the primary, or in Rails console, you should be able to observe that the verification state table got a new row with upload_id matching the latest upload that you created.

In that case, the next Geo::VerificationStateBackfillWorker will start creating records in upload_states for the entire uploads table, in a single job. And Sidekiq will consume a lot of memory.

What is the current bug behavior?

Geo::VerificationStateBackfillWorker can start creating records for an entire table in a single job, far exceeding its batch size of 10000. In that case, Sidekiq will consume a lot of memory.

What is the expected correct behavior?

Geo::VerificationStateBackfillWorker abides by its batch size of 10000. Sidekiq does not consume as much memory.

Relevant logs and/or screenshots

Output of checks

Possible fixes

  • Gitlab::Geo::BaseBatcher is complex. It needs unit tests which cover this edge case, which will lead to a fix. The RegistryBatcher spec covers one usage of BaseBatcher. We should ideally cover all usages of all data types within VerificationStateBackfillWorker and RegistryConsistencyWorker, since some data types override some related methods.
Edited by Michael Kozono