Skip to content

GitLab Next

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
GitLab
GitLab
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 35,921
    • Issues 35,921
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge Requests 1,251
    • Merge Requests 1,251
  • Requirements
    • Requirements
    • List
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Operations
    • Operations
    • Metrics
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Code Review
    • Insights
    • Issue
    • Repository
    • Value Stream
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.org
  • GitLabGitLab
  • Issues
  • #195774

Closed
Open
Opened Jan 08, 2020 by Dylan Griffith@DylanGriffithMaintainer

More gracefully handle when first indexing times out for large project

Problem

As we learnt from gitlab-com/gl-infra/production#1499 (comment 268403773) the initial indexing of associated resources for a project can fail with a Faraday::TimeoutError.

It doesn't seem possible to increase the timeout in the scope of a single request but only seems to be possible globally which was previously discussed in #6241 (closed) . For this it probably doesn't make sense to change our global timeout configuration on GitLab.com because then read requests may hang for too long and this can impact availability.

Proposal

One possible option we have is to catch these timeout errors and retry with a smaller batch size or skip that batch. This is important because issue and merge request descriptions have no size limit and so we may always come across an issue or merge request that cannot be indexed but we don't want to just fail the whole project and index none of it. We should be able to skip things that cannot be indexed under the time limit.

I think we should reduce batch size to 100,10,1 until we stop seeing timeouts and then if we still get timeouts at batch size of 1 we should be skipping that individual record that cannot be indexed. We will want to add careful logging to this whole process so it's easy to debug in future when certain things aren't being indexed.

Alternative proposal

Rather than retrying with smaller batch sizes to make our way through to the rest of the resources we may wish to just enqueue each batch as a separate job. Then those separate jobs can fail and it won't impact other resources. I believe we do something similar with background migrations in terms of enqueuing a range of IDs in batches.

That (as a first step) would already be valuable since we could get past the large issues and start indexing the merge requests (for example) but as a follow to that step we could have each of these batch jobs then catch any timeout exceptions and requeue themselves with smaller batches down to a single resource at which point we just rely on sidekiq retry limit to kill off any individual resource that cannot be indexed.

Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: gitlab-org/gitlab#195774