Bug: Geo primary verification state backfill job can exceed batch size
Summary
In certain cases, Geo::VerificationStateBackfillWorker
can operate on a batch of rows which exceeds the specified batch size. Causing or additionally contributing to memory spikes #429242 (closed).
Steps to reproduce
Suspected steps to reproduce
I noticed this while producing a possible workaround for #429242 (comment 1634287632).
- Have many (e.g. 200000) rows of one data type, e.g.
uploads
- Monitor memory usage of Sidekiq
- Set up Geo for the first time
- Create a new upload e.g. add an attachment to an issue
In gdk psql
on the primary, or in Rails console, you should be able to observe that the verification state table got a new row with upload_id
matching the latest upload that you created.
In that case, the next Geo::VerificationStateBackfillWorker
will start creating records in upload_states
for the entire uploads
table, in a single job. And Sidekiq will consume a lot of memory.
What is the current bug behavior?
Geo::VerificationStateBackfillWorker
can start creating records for an entire table in a single job, far exceeding its batch size of 10000. In that case, Sidekiq will consume a lot of memory.
What is the expected correct behavior?
Geo::VerificationStateBackfillWorker
abides by its batch size of 10000. Sidekiq does not consume as much memory.
Relevant logs and/or screenshots
Output of checks
Possible fixes
-
Gitlab::Geo::BaseBatcher
is complex. It needs unit tests which cover this edge case, which will lead to a fix. The RegistryBatcher spec covers one usage of BaseBatcher. We should ideally cover all usages of all data types within VerificationStateBackfillWorker and RegistryConsistencyWorker, since some data types override some related methods.