Change repository indexing to sorted sets algorithm
Overview
Extracted as follow up from #34086 (closed)
Change the repository indexing to use a sorted sets incremental update model as well. Would require incremental updates to add 1 file per sorted set as described in the last paragraph of #34086 (comment 230326472) . We can leave this to later as well since git indexing queues are not growing wildly yet and we already have some efficiencies today based on the fact that we use the last updated SHA to only update what changed.
Also as noted in !24298 (comment 286562360) one key value this will add is that all indexing work is roughly equally sized and fanned out as early as possible. Fanning out as early as possible is going to make it much easier for SRE or developers watching queues to know if the system is operating in a healthy way since initial indexing will have an expected giant burst in the number of jobs in the queue and then a gradual drop over time as we catch up. This is in contrast to what we have today where there are a few very large jobs being added that hold up the queue for hours during processing and as such all the queues keep growing over the several hours while we're catching up which makes the system look unhealthy and we don't have data to really tell us it's not. This becomes more important as we scale out to 100s or 1000s of groups at a time since we may expect these delays to go from a couple of hours to a couple of days to catch up.
When we do scale out to a couple of days to catch up we may also wish to figure out a way to ensure initial indexing is in a different queue to the incremental updates so that incremental updates aren't hugely delayed. I'm not sure if this change will make that harder or easier.
Proposal
Every blob indexing job only ever deals with a single file. But we process them in batches of 1000. During initial indexing we add one job to the sorted set per file that needs to be indexed and incremental updates only add the jobs for the updated files. We have a cron worker that picks these jobs up in batches of 1000.
Performance considerations
We don't want to regress into doing lots of indexing work in Ruby so we'd like to still leverage the Go indexer. The reason being that the Go indexer is much more CPU efficient at doing the marshalling necessary for indexing these blobs and it would be a performance regression if we went back to doing that in Ruby. As such we may wish to either push all the file names into the Go indexer and have it load them all or possibly even more efficient is to delegate the Redis queue popping into the Go indexer as well though this may be problematic if we end up duplicating lots of queue handling logic in Go.
It also may be problematic that the batches of 1000 may contain files from different repos. If this ends up being less performant from a Gitaly perspective this approach may not be ideal.