Batched background migration sampling times out if the table is too big

Batched background migration sampling first iterates the entire table to calculate batches, then randomly chooses from among those batches for a configurable duration (for example 1 hour).

Unfortunately for very large tables, such as issues, the iteration process itself can exceed the 5 hour timeout on the testing pipeline.

To fix this, we could guess random starting ids for batches then create only the batch records we need for sampling. This would be tricky because we don't know how many batches we need to sample, since it depends on the runtime of the sampling.

It might be possible to use a recursive partitioning scheme, first sampling in the middle of the id range, then the first and 4th quartiles, then recursively subdividing to take a more and more detailed sample of the id range. But if many ids are missing, this could lead to overlapping batches or batches that are much too close together.

Change the sampling behavior to heuristically determine batches to sample, bypassing the full table scan that would otherwise be required.

Edited Aug 12, 2022 by Simon Tomlinson

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information