More gracefully handle when first indexing times out for large project
Problem
As we learnt from gitlab-com/gl-infra/production#1499 (comment 268403773) the initial indexing of associated resources for a project can fail with a Faraday::TimeoutError
.
It doesn't seem possible to increase the timeout in the scope of a single request but only seems to be possible globally which was previously discussed in #6241 (closed) . For this it probably doesn't make sense to change our global timeout configuration on GitLab.com because then read requests may hang for too long and this can impact availability.
Proposal
One possible option we have is to catch these timeout errors and retry with a smaller batch size or skip that batch. This is important because issue and merge request descriptions have no size limit and so we may always come across an issue or merge request that cannot be indexed but we don't want to just fail the whole project and index none of it. We should be able to skip things that cannot be indexed under the time limit.
I think we should reduce batch size to 100,10,1
until we stop seeing timeouts and then if we still get timeouts at batch size of 1
we should be skipping that individual record that cannot be indexed. We will want to add careful logging to this whole process so it's easy to debug in future when certain things aren't being indexed.
Alternative proposal
Rather than retrying with smaller batch sizes to make our way through to the rest of the resources we may wish to just enqueue each batch as a separate job. Then those separate jobs can fail and it won't impact other resources. I believe we do something similar with background migrations in terms of enqueuing a range of IDs in batches.
That (as a first step) would already be valuable since we could get past the large issues and start indexing the merge requests (for example) but as a follow to that step we could have each of these batch jobs then catch any timeout exceptions and requeue themselves with smaller batches down to a single resource at which point we just rely on sidekiq retry limit to kill off any individual resource that cannot be indexed.