Rewrite the GitHub importer to perform work in parallel and greatly improve performance
What does this MR do?
This MR rewrites the GitHub importer from scratch so it's much faster and performs work in parallel. This MR is still a WIP, I'll update the body properly once we get closer to a final state.
Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/33135, fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/38621, fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/39361
TODO
-
Use a separate class for importing issue comments -
Import issue comments using a worker (1 job per comment) -
Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future) -
Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a sleep
call -
Make sure any stuck importer jobs don't mess with a running GitHub import -
Import releases -
Re-fetch if a PR was updated after the initial repository fetch -
See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly) -
Set a release's description to the tag name if none is provided (since a description is required) -
Add tests for User.by_any_email
- Add tests for the importer code
-
Base classes such as Client
andUserFinder
-
Representation classes -
Importer classes -
Sidekiq workers ( GithubImporter
namespace) -
Sidekiq worker concerns -
Changes to Projects::ImportService
to support async importers -
Changes to RepositoryImportWorker
-
-
Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info. -
Sometimes inserts into merge_requests
produce duplicate key errors in the indexindex_merge_requests_on_target_project_id_and_iid
. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info.- It's also possible this is caused by a job failing, then being retried. Wrapping all work in a transaction could solve this.
-
Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming -
The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however. -
It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing. - Possible theory: page 1 is never written to the cache because of
current_page
defaulting to 1 andif page > current_page
inset
. This means that if we were to request page 2 (after having processed page 1) and get a rate limit error we'd process page 1 again. This could then lead to duplicate key errors if jobs from the same page run in parallel. This however doesn't explain why I didn't see the page number being cached at all.
- Possible theory: page 1 is never written to the cache because of
Projects To Test
The following projects should be tested using this new importer on GitLab.com (one way or another):
-
https://github.com/yorickpeterse/oga: 103 seconds on staging, 47 seconds on production -
https://github.com/kubernetes/kubernetes: 6.5 hours on production -
https://github.com/rubinius/rubinius: 22.3 minutes on staging, 20 minutes on production
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
API support added -
Tests added for this feature/bug - Review
-
Has been reviewed by Backend -
Has been reviewed by Database
-
-
Conform by the merge request performance guides -
Conform by the style guides -
Squashed related commits together
Edited by Yorick Peterse