Rewrite the GitHub importer to perform work in parallel and greatly improve performance

Review changes
Open in Workspace
Download
Patches
Plain diff

Rewrite the GitHub importer to perform work in parallel and greatly improve performance

Yorick Peterserequested to merge

github-importer-refactor into master Oct 06, 2017

Overview 118
Commits 5
Pipelines 82
Reports 3
Changes 186

What does this MR do?

This MR rewrites the GitHub importer from scratch so it's much faster and performs work in parallel. This MR is still a WIP, I'll update the body properly once we get closer to a final state.

Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/33135, fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/38621, fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/39361

TODO

Use a separate class for importing issue comments
Import issue comments using a worker (1 job per comment)
Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future)
Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a sleep call
Make sure any stuck importer jobs don't mess with a running GitHub import
Import releases
Re-fetch if a PR was updated after the initial repository fetch
See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly)
Set a release's description to the tag name if none is provided (since a description is required)
Add tests for User.by_any_email
Add tests for the importer code
- Base classes such as Client and UserFinder
- Representation classes
- Importer classes
- Sidekiq workers (GithubImporter namespace)
- Sidekiq worker concerns
- Changes to Projects::ImportService to support async importers
- Changes to RepositoryImportWorker
Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info.
Sometimes inserts into merge_requests produce duplicate key errors in the index index_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info.
- It's also possible this is caused by a job failing, then being retried. Wrapping all work in a transaction could solve this.
Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming
The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however.
It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing.
- Possible theory: page 1 is never written to the cache because of current_page defaulting to 1 and if page > current_page in set. This means that if we were to request page 2 (after having processed page 1) and get a rate limit error we'd process page 1 again. This could then lead to duplicate key errors if jobs from the same page run in parallel. This however doesn't explain why I didn't see the page number being cached at all.

Projects To Test

The following projects should be tested using this new importer on GitLab.com (one way or another):

https://github.com/yorickpeterse/oga: 103 seconds on staging, 47 seconds on production
https://github.com/kubernetes/kubernetes: 6.5 hours on production
https://github.com/rubinius/rubinius: 22.3 minutes on staging, 20 minutes on production

Does this MR meet the acceptance criteria?

Changelog entry added, if necessary
Documentation created/updated
API support added
Tests added for this feature/bug
Review
- Has been reviewed by Backend
- Has been reviewed by Database
Conform by the merge request performance guides
Conform by the style guides
Squashed related commits together

Edited Nov 07, 2017 by Yorick Peterse

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Loading

Time tracking Loading

Loading