Skip to content

Rewrite the GitHub importer to perform work in parallel and greatly improve performance

Yorick Peterse requested to merge github-importer-refactor into master

What does this MR do?

This MR rewrites the GitHub importer from scratch so it's much faster and performs work in parallel. This MR is still a WIP, I'll update the body properly once we get closer to a final state.

Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/33135, fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/38621, fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/39361

TODO

  • Use a separate class for importing issue comments
  • Import issue comments using a worker (1 job per comment)
  • Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future)
  • Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a sleep call
  • Make sure any stuck importer jobs don't mess with a running GitHub import
  • Import releases
  • Re-fetch if a PR was updated after the initial repository fetch
  • See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly)
  • Set a release's description to the tag name if none is provided (since a description is required)
  • Add tests for User.by_any_email
  • Add tests for the importer code
    • Base classes such as Client and UserFinder
    • Representation classes
    • Importer classes
    • Sidekiq workers (GithubImporter namespace)
    • Sidekiq worker concerns
    • Changes to Projects::ImportService to support async importers
    • Changes to RepositoryImportWorker
  • Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info.
  • Sometimes inserts into merge_requests produce duplicate key errors in the index index_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info.
    • It's also possible this is caused by a job failing, then being retried. Wrapping all work in a transaction could solve this.
  • Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming
  • The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however.
  • It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing.
    • Possible theory: page 1 is never written to the cache because of current_page defaulting to 1 and if page > current_page in set. This means that if we were to request page 2 (after having processed page 1) and get a rate limit error we'd process page 1 again. This could then lead to duplicate key errors if jobs from the same page run in parallel. This however doesn't explain why I didn't see the page number being cached at all.

Projects To Test

The following projects should be tested using this new importer on GitLab.com (one way or another):

Does this MR meet the acceptance criteria?

Edited by Yorick Peterse

Merge request reports