Change all the initial importing to push into the sorted sets instead of using the sidekiq queues like we currently have. This can be left out of this issue since it's the incremental updates which causes the queues to grow massively and are generally less efficient due to duplicate updates. We should also see if we can put initial indexing into a separate queue but processed with exactly the same logic. This would allow incremental updates to keep processing and not end up being delayed because they happen after a very large queue forms for newly added groups. This could be an optimization later though.
Designs
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
UPDATE: I think we can do it without the steps defined in #207494 (closed) since that is more about handling the callbacks for Project and the problems with that. This could be special case handling for create or initial enabling which I think are different code paths than the active record callbacks anyway.
TL;DR this is probably not one of our highest priorities yet since initial indexing is already using the bulk API and it's tricky to accomplish
Spending more time thinking about this I wanted to note that this may not add as much value to outweigh effort. The other benefit is that an early fan out of all the records to be indexed makes it much easier to track overall progress of large initial indexing jobs (ie. adding many large customers to the index)
Basically the primary motivation here is that we can move all indexing logic in GitLab to travel through a single code path. This doesn't really have much major performance improvements since the initial indexing already uses the bulk API. This makes a lot of larger changes like multi-index support or performance tweaking of the bulk indexing algorithm easier to accomplish. The problem we have is that if we do just make this change today then initial importing will end up in the same sorted set as incremental indexing and thus large initial imports will add large latencies for noticing the updates from incremental changes in the index. Separating these queues was a recent benefit of implementing the sorted sets for incremental updates and it means that even if there are large initial indexing in process then incremental updates keep flowing smoothly.
So in practice in order to do this work we'll need to come up with a way to have multiple separate queues processed by the same sorted sets logic. Then all initial indexing will go into 1 queue so as to not delay the incremental updates.
This is still worth doing as eventually for the sake of horizontal scaling it may well end up being beneficial to have separate queues for incremental updates anyway so figuring out the logic of how to have multiple instances of the cron worker processing different queues could likely benefit the long term architecture.
@DylanGriffith , thanks for the elaboration. I will move it to backlog and we can revisit it after we can separate initial import from incremental updates.