Rewrite the GitHub importer to perform work in parallel and greatly improve performance
What does this MR do?
This MR rewrites the GitHub importer from scratch so it's much faster and performs work in parallel. This MR is still a WIP, I'll update the body properly once we get closer to a final state.
Fixes #33135 (closed), fixes #38621 (closed), fixes #39361 (closed)
TODO
- Use a separate class for importing issue comments
- Import issue comments using a worker (1 job per comment)
- Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future)
-
Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a
sleepcall - Make sure any stuck importer jobs don't mess with a running GitHub import
- Import releases
- Re-fetch if a PR was updated after the initial repository fetch
- See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly)
- Set a release's description to the tag name if none is provided (since a description is required)
-
Add tests for
User.by_any_email - Add tests for the importer code
-
Base classes such as
ClientandUserFinder - Representation classes
- Importer classes
-
Sidekiq workers (
GithubImporternamespace) - Sidekiq worker concerns
-
Changes to
Projects::ImportServiceto support async importers -
Changes to
RepositoryImportWorker
-
Base classes such as
- Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info.
-
Sometimes inserts into
merge_requestsproduce duplicate key errors in the indexindex_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info.- It's also possible this is caused by a job failing, then being retried. Wrapping all work in a transaction could solve this.
- Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming
- The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however.
-
It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing.
- Possible theory: page 1 is never written to the cache because of
current_pagedefaulting to 1 andif page > current_pageinset. This means that if we were to request page 2 (after having processed page 1) and get a rate limit error we'd process page 1 again. This could then lead to duplicate key errors if jobs from the same page run in parallel. This however doesn't explain why I didn't see the page number being cached at all.
- Possible theory: page 1 is never written to the cache because of
Projects To Test
The following projects should be tested using this new importer on GitLab.com (one way or another):
- https://github.com/yorickpeterse/oga: 103 seconds on staging, 47 seconds on production
- https://github.com/kubernetes/kubernetes: 6.5 hours on production
- https://github.com/rubinius/rubinius: 22.3 minutes on staging, 20 minutes on production
Does this MR meet the acceptance criteria?
- Changelog entry added, if necessary
- Documentation created/updated
- API support added
- Tests added for this feature/bug
- Review
- Has been reviewed by Backend
- Has been reviewed by Database
- Conform by the merge request performance guides
- Conform by the style guides
- Squashed related commits together
-
-
restored source branch
github-importer-refactorToggle commit list -
Master
For this MR my goal is to get https://github.com/kubernetes/kubernetes to import in roughly a minute or so, at least when using the parallel importer (the Rake task will still be sequential). This however requires a near total rewrite of the importer as I couldn't make this work in the old code.
-
Master
@yorickpeterse As far as I know
Gitlab::GithubImport::Importeris only used for Gitea import, GitHub import now usesGithub::Import. -
Toggle commit list
-
Master
@rymai Yeah I renamed that one to
LegacyGithubImport. I didn't rename GiteaImport it since it used Octokit and some other GitHub specific bits, but perhaps that's a better name.I also just pushed a new version which adds the ability to import issues and what not properly. Pull requests and releases are not yet handled. Issues are also imported in parallel, which results in https://github.com/yorickpeterse/oga being imported in 15-20 seconds instead of 120 seconds (excluding PRs and releases).
-
Toggle commit list
-
Master
I spent a few more hours hacking on this and things are starting to come together. Labels and milestones are assigned properly, data for non existing users is handled, etc.
Along the way I did get hit by the GitHub rate limit when trying to import https://github.com/kubernetes/kubernetes. This isn't entirely surprising because this project has 4626 issues and we need at least an equal amount of requests to get the comments for these issues. In the current setup this means we may (in the worst case) have to wait up to an hour to get the data.
My plan for dealing with this is to break importing comments up into a separate worker as this will consume most requests. These jobs will then re-schedule themselves in the future when hitting the rate limit, based on when the rate limit will reset. This means importing Kubernetes can still take an hour or two because of this, but at least we won't have threads blocked in a
sleepcall. -
changed the description
Toggle commit list -
Master
Potential option: there's an API endpoint for getting comments across a repository (e.g. https://api.github.com/repos/YorickPeterse/oga/issues/comments). While the returned comment payloads don't contain a direct association to their "parent" (issue or PR), we could parse this from the
issue_urlfield. Assuming this were to work it means we'd only need 4626 / 100 (the maximum number of objects per page) = 47 requests. -
Master
Yeah I renamed that one to
LegacyGithubImport. I didn't rename GiteaImport it since it used Octokit and some other GitHub specific bits, but perhaps that's a better name.Oh, ok I see: you've rewritten
Gitlab::GithubImport::Importer(the old importer) and kept the newGithub::Import(which won't be used anymore after this refactor). That was what confused me! :PMy plan for dealing with this is to break importing comments up into a separate worker as this will consume most requests. These jobs will then re-schedule themselves in the future when hitting the rate limit, based on when the rate limit will reset. This means importing Kubernetes can still take an hour or two because of this, but at least we won't have threads blocked in a
sleepcall.Sounds good!
Potential option: there's an API endpoint for getting comments across a repository (e.g. https://api.github.com/repos/YorickPeterse/oga/issues/comments). While the returned comment payloads don't contain a direct association to their "parent" (issue or PR), we could parse this from the
issue_urlfield. Assuming this were to work it means we'd only need 4626 / 100 (the maximum number of objects per page) = 47 requests.@yorickpeterse Yes, that's what I discovered and proposed too: #38200 :)
-
-
-
Master
@yorickpeterse Man that's fast, local benchmarks when importing
guard/guard(531 issues):-
master: 1228 seconds (~20 minutes)🐌 -
this branch: 74 seconds (1 minutes, 14 seconds)🚀
-
-
Toggle commit list
-
marked the task Import issue comments using a worker (1 job per comment) as completed
Toggle commit list -
marked the task Use a separate class for importing issue comments as completed
Toggle commit list -
Master
Importing comments is now done using the "comments per repo" endpoint (https://developer.github.com/v3/issues/comments/#list-comments-in-a-repository), saving a lot of API calls.
Pull requests will be a bit tricky. In the old approach we would restore some branches and (as far as I can tell) would generate diffs using our Git clone. This means we could hit NFS pretty hard if we were to run e.g. 100 PR importer jobs in parallel. Unfortunately the GitHub API does not appear to provide a single API endpoint to get the diffs in a merge request. I'll see what we can do about this.
-
Master
Pull requests will be a bit tricky. In the old approach we would restore some branches and (as far as I can tell) would generate diffs using our Git clone. This means we could hit NFS pretty hard if we were to run e.g. 100 PR importer jobs in parallel. Unfortunately the GitHub API does not appear to provide a single API endpoint to get the diffs in a merge request. I'll see what we can do about this.
@yorickpeterse If you take a look at the new importer
Github::Import, we're not restoring branches anymore. :) -
MasterEdited by Yorick Peterse
@rymai Aha, that explains why I couldn't find any calls to
restore_branch!and the likes any more; I could have sworn they were there, but them being refactored away explains why I couldn't find them. What is the commit (or commits) that introduced this change? That might make it easier for me to understand how to best port that logic over to this parallel importer. -
Master
Ah I found it, those calls were removed in 43b75b38. I'll take a look at that and the following commits.
-
Master
Aha, that explains why I couldn't find any calls to
restore_branch!and the likes any more; I could have sworn they were there, but them being refactored away explains why I couldn't find them. What is the commit (or commits) that introduced this change? That might make it easier for me to understand how to best port that logic over to this parallel importer.@yorickpeterse Yeah, the MR was !14445 (merged), the description of the MR and its commits can probably help you understand what was done and why! Also, feel free to checkout https://gitlab.com/gitlab-org/gitlab-ce/issues/36284 for more background and work on improving importing and performance in general.
-
Master
I poked around with the GitHub GraphQL API and this looks quite promising. For example, it would allow us to get issue/PR authors/assignees including their Emails, removing the need for API calls to get that data. This is quite important as looking up e.g. 1000 users could in the worst case lead to 1000 GitHub API calls.
To give an example, here's a GraphQL query to get all issues of a repository along with the associated data we'd need:
query($owner: String!, $repo: String!, $after: String) { repository(owner: $owner, name: $repo) { issues(states: [OPEN, CLOSED], first: 100, after: $after, orderBy: { field: CREATED_AT, direction: ASC }) { nodes { number title body author { ... on User { id: databaseId login email } } assignees(first: 10) { nodes { ... on User { id: databaseId login email } } } milestone_iid: milestone { number } label_names: labels(first: 10) { nodes { name } } } pageInfo { has_next_page: hasNextPage end_cursor: endCursor } } } rate_limit: rateLimit { remaining reset_at: resetAt } }This query includes some info necessary for pagination as well as rate limiting details. The output will be something along the lines of:
{ "data": { "repository": { "issues": { "nodes": [ { "number": 1, "title": "Lexing doctypes", "body": "Both the lexer and the HTML parser should support the processing of doctypes. Currently both already have basic support for doctypes but this should be improved so that it's easier to extract information such as the doctype URL.\n", "author": { "id": 86065, "login": "YorickPeterse", "email": "yorickpeterse@gmail.com" }, "assignees": { "nodes": [] }, "milestone_iid": null, "label_names": { "nodes": [ { "name": "HTML" }, { "name": "Lexer" } ] } }, ... ], "pageInfo": { "has_next_page": true, "end_cursor": "Y3Vyc29yOnYyOpK5MjAxNC0wMi0yOFQxMTozODo1NCswMTowMM4Bsr8v" } } }, "rate_limit": { "remaining": 4966, "reset_at": "2017-10-10T15:09:30.000Z" } } }There's a Ruby client written by GitHub here: https://github.com/github/graphql-client. It seems to be able to send calls over HTTP out of the box, though the code seems to be more of a proof of concept: https://github.com/github/graphql-client/blob/master/lib/graphql/client/http.rb
-
Master
And here's a similar query for getting pull requests, including labels and what not:
query($owner: String!, $repo: String!, $after: String) { repository(owner: $owner, name: $repo) { pullRequests(states: [OPEN, CLOSED, MERGED], first: 100, after: $after, orderBy: { field: CREATED_AT, direction: ASC }) { nodes { number title body author { ... on User { id: databaseId login email } } assignees(first: 10) { nodes { ... on User { id: databaseId login email } } } milestone_iid: milestone { number } label_names: labels(first: 10) { nodes { name } } } pageInfo { has_next_page: hasNextPage end_cursor: endCursor } } } rate_limit: rateLimit { remaining reset_at: resetAt } }One minor nuisance in the GraphQL API is that paginating nested objects (e.g. labels) will be annoying. However I think we could just say we'll only import up to 10 assignees and labels (or something along those lines) per issue/pull request. There shouldn't be that many cases (if any) where an issue/PR has more than 10 labels.
-
MasterEdited by Yorick Peterse
Annoyingly the GraphQL API does not support a way of getting comments across an entire repository, instead you can only get them per issue/PR/etc. This means we'd need to either embed those in the above queries and then paginate them, which I can see being quite annoying. This is annoying because we'd have to run many API calls per issue/PR to get the data, so it might be more efficient to just use the REST API.
-
Master
So pros/cons of GraphQL:
Pros:
- Fewer calls necessary to get data such as labels of issues and PRS
- No additional calls necessary to get the Email address of users
- Query language is decent
Cons:
- Need to implement pagination yourself using GraphQL cursors
- Need API calls to traverse comment pages for every issue/PR
- Very new
- I'd have to rewrite most of the code I wrote thus far
I'll stick with REST for now, unless I really start hating it.
-
added 2 commits
Toggle commit list -
Toggle commit list
-
Toggle commit list
-
Toggle commit list
-
Master
@rymai @dbalexandre I noticed that creating notes will result in a call to
Repository#keep_aroundto keep the commits. I'm wondering if this is necessary when importing? Getting rid of this would save us a lot of Git/NFS operations when importing large projects. -
Master
To add to that: the number of Gitaly calls that I see flying past in my console when importing PRs and diff notes is quite worrying. I have doubts about our setup being able to handle large imports if we keep hammering Git for every MR and note.
-
Toggle commit list
-
-
-
Toggle commit list
-
marked the task Import releases as completed
Toggle commit list -
Toggle commit list
-
Toggle commit list
-
Toggle commit list
-
marked the task Make the parallel importing schedule a job to check for progress, instead of blocking the thread in a
sleepcall as completedToggle commit list -
marked the task Issue and comment workers should reschedule themselves in the future if we hit a rate limit, the reschedule time will simply be the reset time of the rate limit (so if our rate limit resets in 10 seconds that means we schedule jobs for 10 seconds in the future) as completed
Toggle commit list -
marked the task Make sure any stuck importer jobs don't mess with a running GitHub import as completed
Toggle commit list -
MasterEdited by Yorick Peterse
I managed to break up the importer into a pipeline consisting out of the following stages:
-
ImportRepository: this will import base data such as the repository, wiki, labels, milestones, and releases (this data can be imported quickly enough in a single thread) -
ImportPullRequests: this will import all pull requests from GitHub -
ImportIssuesAndDiffNotes: this will create regular issues, enrich merge requests with labels, and create merge request diff notes -
ImportNotes: this creates regular notes for both issues and merge requests, thus we have to schedule this after those have been imported -
FinishImport: this will simply mark the job as finished and clear any caches
Between every stage we schedule a job for the AdvanceStageWorker. This worker will check if the work from the previous stage is done and if so it will advance the pipeline. If this isn't the case it will reschedule itself and check again later. Using this process we only very briefly block threads in
sleepcalls (when checking of jobs are completed in AdvanceStageWorker we wait for a little bit to reduce the number of reschedules), instead of doing so for hours on end.A downside of this approach is that there's a minimum import time of around 1 minute or so, though this depends largely on how fast jobs can be completed before the AdvanceStageWorker reschedules itself. This however is a trade-off that's worth it as the alternative is blocking an entire thread (or multiple depending on the number of import jobs running) for a long period of time.
Using this setup I can import github.com/yorickpeterse/oga in about 100 seconds on my laptop, whereas it would take the old importer around 4.5 minutes.
API call wise importing Oga requires 91 API calls without user information being cached. With this data being cached we only need 11 API calls.
The code also takes care of refreshing the import JID so StuckImportJobsWorker doesn't mark the import process as failed after 15 minutes, though I have to still properly test this.
My next plan is to write automated tests for all this code (and probably fix a bunch of bugs in the process), then run some tests on larger projects such as Kubernetes.
-
-
Toggle commit list
-
Toggle commit list
-
changed the description
Toggle commit list -
Toggle commit list
-
added 493 commits
-
04e689fb...34fd07f6 - 491 commits from branch
master - 2f10fa3e - Refactor User.find_by_any_email
- 214b70a6 - Rewrite the GitHub importer from scratch
Toggle commit list -
04e689fb...34fd07f6 - 491 commits from branch
-
changed the description
Toggle commit list -
Toggle commit list
-
marked the task Add tests for
User.by_any_emailas completedToggle commit list -
resolved all discussions
Toggle commit list -
Toggle commit list
-
added 2 commits
Toggle commit list -
Toggle commit list
-
changed the description
Toggle commit list -
Hi Yorick,
Thanks for this work. I've successfully migrated a GitHub repo, where the legacy importer failed to import the issues. Congrats for your work.
Btw, I've just "cherry-picked" the changes of the two commits (8b4fe973 and e7cf1c7d) and patched them on my omnibus installation 10.0.3 8895150 and that worked fine!
I've seen you've discussed the (re-)naming of the importer. I personally would suggest to leave the legacy importer un-renamed (and files not moved) and instead file the new code under e.g.
github_import_ng(for next generation) or_v2.Also, would it not be great to have all importers named the same, i.e.
Gitlab::GithubImportNg::Importerrather thanGithub::Import?I'm very new to this project and there might be good reasons for how you have chosen to name things.
Cheers, Martin.
-
-
One observation I'd like to add:
After importing a GitHub repo with the patched version of the CE, I exported the same repo again in order to import it on gitlab.com. However, that import failed because some releases' description field were empty (
""). Once I've corrected theproject.jsonin the export by replacing$.releases[*].descriptionwith$.releases[*].tagwhendescription === '', the import went through smoothly.Now obviously, this is an issue of the
ImportExportmodule, as it allows to export empty descriptions but does not import them again. However, it might be a follow-up error by the fact that this patch here does import empty descriptions where otherwise this would not be allowed? -
See also #18235 Refactor GitHub importer
-
-
Also see !13788 (closed)
-
One thing I have realised is, that labels are correctly generated. However, not assigned to MRs or issues.
-
-
MasterEdited by Yorick Peterse
I've seen you've discussed the (re-)naming of the importer. I personally would suggest to leave the legacy importer un-renamed (and files not moved) and instead file the new code under e.g. github_import_ng (for next generation) or _v2.
Including versions/generations in namespaces in my opinion is a bad idea, I'd much rather just have a "legacy" importer vs "the current" importer. It's worth mentioning that I may just rename LegacyGithubImporter to GiteaImporter since it's only used for Gitea.
Also, would it not be great to have all importers named the same, i.e. Gitlab::GithubImportNg::Importer rather than Github::Import?
This separation is the result of Sidekiq workers residing in app/workers and not lib/gitlab/X. Having said that I think creating directories such as app/workers/gitlab/github_import/ and placing the workers there should also work. I'll take a look at this.
Now obviously, this is an issue of the ImportExport module, as it allows to export empty descriptions but does not import them again. However, it might be a follow-up error by the fact that this patch here does import empty descriptions where otherwise this would not be allowed?
It seems the old importer would validate releases, skipping those that are invalid. This validation process requires a description to be present. Validations are currently bypassed since we insert releases in bulk, though we can quite easily validate data without having to initialize a Release object for every release.
One thing I have realised is, that labels are correctly generated. However, not assigned to MRs or issues.
They should be, at least this was working last time I tried the code in this MR locally. Perhaps I broke it while refactoring the code in recent pushes. I'm currently in the process of writing tests for all the code, hopefully that will fix any remaining bugs once I'm done.
-
marked the task Representation classes as completed
Toggle commit list -
Master
Regarding the milestone: I moved this to 10.2, at least for now. I initially hoped to be able to squeeze this into 10.1 behind a feature flag, but since the 22nd is this Sunday and the summit starts before that I don't think I'll make it.
Maybe I can make it, but the more likely alternative will be hot patching GitLab.com so we can test this, then ship it in 10.2.0.
-
Thanks for your response and all the explanations. Much appreciated.
Re release description: I would suggest to not skip the import of a given release when no description is given. Rather using the tag name as a description.
Re labels: I checked the code and I have not been able to identify the issue. However, they are not assigned (or not updated)
-
Master
@martin.rueegg That link points to the old importer, not the one being added in this MR. The new code is in the
LabelLinksImporterclass in this MR. The old importer indeed had issues with not assigning labels properly. -
Toggle commit list
-
changed the description
Toggle commit list -
changed the description
Toggle commit list -
@yorickpeterse, oh! :-D ok.
Well, then I have not run your importer! And I don't know if I can patch my 10.0.3 omnibus installation to test it ...
-
Toggle commit list
-
marked the task Importer classes as completed
Toggle commit list -
marked the task Set a release's description to the tag name if none is provided (since a description is required) as completed
Toggle commit list -
Toggle commit list
-
Master
All code now resides in the
Gitlab::GithubImportnamespace, instead of the Sidekiq code residing inGithubImporter. -
MasterEdited by Yorick Peterse
Well, then I have not run your importer! And I don't know if I can patch my 10.0.3 omnibus installation to test it ...
If you want to test things you'd have to apply the following patch using
patch -p1 -N < file.patch: importer.patchI haven't tested this with EE so it might not work there yet. You'll then have to muck around with the Rake task to get it to use the new importer, e.g. by applying this patch: rake.patch. You can then run the new importer using this:
env IMPORTER=parallel bundle exec rake import:github[token,gitlab-user,gitlab-user/gitlab-repo,github-user/github-repo]For example:
env IMPORTER=parallel bundle exec rake import:github[hunter2,yorickpeterse,yorickpeterse/kubernetes-community,kubernetes/community]For this to work you'll need to restart Sidekiq so it has workers available for the new queues. You can also use
IMPORTER=sequentialto use the new code in sequential mode, but this will be slower for large projects compared toIMPORTER=parallel. -
marked the task Changes to
Projects::ImportServiceto support async importers as completedToggle commit list -
marked the task Changes to
RepositoryImportWorkeras completedToggle commit list -
marked the task Sidekiq workers (
GithubImporternamespace) as completedToggle commit list -
marked the task Sidekiq worker concerns as completed
Toggle commit list -
MasterEdited by Sean McGivern
@smcgivern @DouweM This MR is finally reaching a point where all code is tested automatically, meaning it's time to document the changes, write a decent commit message, and most importantly of all: test it on GitLab.com somehow. In particular I want to test the impact of importing projects such as the one the Edge team are working on, how long it takes, if we need more monitoring, etc.
There are two approaches to doing this:
- Include this code in 10.1 but hide it behind a feature flag (that's disabled by default) and don't announce it yet (or announce it as being in a beta state of sorts)
- Don't include this in 10.1 and aim for 10.2, then hotpatch GitLab.com until we're done with testing
The first approach has the benefit of being the easiest to test as we don't need any patching. It also means we can start importing larger projects (if desired) without having to wait another month. There are also two downsides of this approach:
- We're way past the merge window and the changes here are quite big (though a feature flag should prevent anything from breaking).
- It will require reviewing before this Sunday, which can be difficult with the summit.
Option 2 doesn't suffer from these problems but instead suffers from the following:
- Testing this code will be a pain because we'd have to hot-patch all Sidekiq workers, instantly affecting everybody that imports a GitHub project (unless we also hide the code behind a feature flag).
- It will take at least a month before we can ship this to GitLab.com in a state that doesn't require us to run patches after every deploy.
As such I'd like to know what you think would be best here.
-
Toggle commit list
-
Master
73632809 enables the new importer using a feature flag. It's not super pretty but it gives an idea of how we'd enable/test this until 10.2, if we were to include this in 10.1 that is.
-
unmarked as a Work In Progress
Toggle commit list -
changed the description
Toggle commit list -
Master
@yorickpeterse thanks for laying out those options. I don't like the idea of merging and picking by Sunday. However, if it's behind a feature flag anyway, could we not just include it in a patch to 10.1? It is stretching the definition of 'patch' quite a lot, but if it has no impact without the feature flag, that seems like a reasonable compromise?
-
Master
@yorickpeterse @smcgivern I like the compromise option!
-
added 321 commits
-
73632809...bd39b441 - 318 commits from branch
master - f3e238c2 - Refactor User.find_by_any_email
- 35538cd5 - Rewrite the GitHub importer from scratch
- eeb18a5b - Enable the new GitHub importer with a feature flag
Toggle commit list -
73632809...bd39b441 - 318 commits from branch
-
Master
@smcgivern That seems like a better option, that way we don't have to rush things out now.
Unrelated to this, it seems the build is failing due to static analysis tools failing on unrelated files. Not sure what's going on here.
-
Master
@yorickpeterse hmm, that's weird. One of the errors seems slightly related, but the others are completely unrelated. Does
bundle exec rake flayalso fail locally? -
Master
@smcgivern Locally it fails as well, but it seems to spit out similar output when I switch back to the
masterbranch:app/models/user.rb:41 :: parse error on value ":" (tCOLON) skipping app/models/user.rb app/models/concerns/awardable.rb:5 :: parse error on value nil (error) skipping app/models/concerns/awardable.rb lib/gitlab/git/storage/health.rb:29 :: parse error on value ["do", 29] (kDO_BLOCK) skipping lib/gitlab/git/storage/health.rb lib/gitlab/git/storage/forked_storage_check.rb:46 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/git/storage/forked_storage_check.rb lib/gitlab/background_migration/delete_conflicting_redirect_routes_range.rb:28 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/background_migration/delete_conflicting_redirect_routes_range.rb lib/gitlab/background_migration/create_fork_network_memberships_range.rb:28 :: parse error on value "FROM" (tCONSTANT) skipping lib/gitlab/background_migration/create_fork_network_memberships_range.rb lib/gitlab/background_migration/populate_fork_networks_range.rb:15 :: parse error on value "FROM" (tCONSTANT) skipping lib/gitlab/background_migration/populate_fork_networks_range.rb lib/gitlab/sql/glob.rb:9 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/sql/glob.rb lib/gitlab/exclusive_lease.rb:13 :: parse error on value "<<" (tLSHFT) skipping lib/gitlab/exclusive_lease.rb -
Master
@yorickpeterse ah, it looks like that exits with zero, though. Our rake task is this:
desc 'Code duplication analyze via flay' task :flay do output = `bundle exec flay --mass 35 app/ lib/gitlab/` if output.include? "Similar code found" puts output exit 1 end endSo I think that if we don't care about the similar code in the importers (haven't reviewed this), we could ignore that, or update the threshold. I have no idea why the untouched files also generate that message.
-
Master
@smcgivern Ah ok. The similar code issue is something I'm tackling now so that should solve the failure.
-
changed the description
Toggle commit list -
marked the task See if we can re-use GitHub's diffs from the API for diff notes, instead of using Git(aly) as completed
Toggle commit list -
MasterEdited by Yorick Peterse
@rymai In the old importer we'd re-fetch the refs of a repository if a pull request was updated after the repository was cloned. In the parallel importer this is currently not done because:
- Pull requests are imported in separate threads/processes/hosts meaning we can't store a timestamp in a local variable and re-use it.
- Pull request jobs are not given any additional timestamps for this, and I prefer to keep it that way. Right now the Sidekiq interface is generic, making it easy to add more workers without having to duplicate a lot of code. Adding this fetch feature would require specialisation just for importing PRs.
- This would require some form of synchronisation (e.g. an ExclusiveLease) as otherwise many jobs could attempt to run the same
git fetch. Adding this is something I prefer to avoid
This leads me to this question: is this feature important enough that we want to include it?
-
Master
To reduce the number of queries and Gitaly RPC calls I made some changes to the Note, DiffNote, and Issue importers. These changes result in the importers bypassing the callbacks and validations of models by using
Gitlab::Database.bulk_insertinstead of the models. Bypassing the models (and thus any callbacks and validations) saves us a lot of SQL queries and a handful of Gitaly RPC calls.To test this I imported https://github.com/yorickpeterse/oga into my local development environment. The before/after can be seen below.
Before bypassing callbacks/validation
Gitaly calls:
RPC call Amount repository_exists 36 has_local_branches 1 fetch_remote 1 find_commit 877 commits_between 70 commit_diff 70 ref_exists 490 - Total RPC calls: 1545
- Total SQL queries: 10806
- Total time: 165 seconds
After bypassing callbacks/validation
Gitaly calls:
RPC call Amount repository_exists 36 has_local_branches 1 fetch_remote 1 find_commit 770 commits_between 70 commit_diff 70 ref_exists 420 - Total RPC calls: 1368
- Total SQL queries: 4621
- Total time: 87 seconds
Conclusion
So by making these changes we save ourselves 177 Gitaly RPC calls, 6185 SQL queries, and about 78 seconds of execution time. Do keep in mind that this test was run over the summit's hotel wifi, thus the timings are likely to be influenced by how responsive the wifi was during the test.
-
Master
With some further caching (of labels and milestones) we can reduce the number of SQL queries to 4470 for Oga. I'll see if I can further reduce the queries, since 4470 is still quite a lot for such a small project.
-
Master
As it turns out the caching of user IDs/Emails wasn't entirely working as expected. Fixing this and some other issues I ran into reduces the number of SQL queries to 4409. This is still a lot, but we're slowly getting there.
-
-
changed the description
Toggle commit list -
Master
With some further caching being added for issue/MR lookups (when importing notes, diff notes, and label links) the number of SQL queries can be reduced to 3590. About 1450 queries are for getting the feature flag names (see #39361 (closed) for more information), which isn't something I can really work around in this MR.
The remaining queries are mostly SELECT queries to build the cache, and a lot of INSERTs for the data we're importing.
-
changed the description
Toggle commit list -
-
added 133 commits
-
e3cfb168...e16add22 - 129 commits from branch
master - cb58a673 - Refactor User.find_by_any_email
- 87722696 - Add returning IDs to Gitlab::Database.bulk_insert
- 371d16e4 - Rewrite the GitHub importer from scratch
- 661489c0 - Enable the new GitHub importer with a feature flag
Toggle commit list -
e3cfb168...e16add22 - 129 commits from branch
-
Master
is this feature important enough that we want to include it?
We could probably fail a PR import if the ref isn't fetched and restart a new import, which would do a
fetchfirst so that PR that weren't imported correctly would be imported fine since their refs would have been fetched this time... I still need to look at this massive diff to see if that'd be possible, but you probably see the idea. -
-
-
Master
We could probably fail a PR import if the ref isn't fetched and restart a new import, which would do a
fetchfirst so that PR that weren't imported correctly would be imported fine since their refs would have been fetched this time... I still need to look at this massive diff to see if that'd be possible, but you probably see the idea.All pull requests are imported in parallel, and I'd rather not have all of them also check the repository status using Git as this is likely to blow up the NFS shard.
So if we want this we need to store a timestamp somewhere that marks the time we finished cloning the repository. We then need synchronisation across processes/threads to make this work, otherwise e.g. 100 threads will run
git fetchin parallel. -
Master
All pull requests are imported in parallel, and I'd rather not have all of them also check the repository status using Git as this is likely to blow up the NFS shard.
@yorickpeterse Sorry, it seems I wasn't clear in my answer above, I meant if a MR diff generation fails (most probably because ref isn't fetched), then we should retry the import after it's finished (in a failed state because of the MR diff that couldn't be generated). The import retry would start with a fresh
git fetch, so the MR diff that couldn't be generated should now be generated properly...Anyways, I will need to go through the diff to understand better how this will (or won't) work.
-
added 141 commits
-
7b1d922a...6919f636 - 136 commits from branch
master - dad72cdd - Refactor User.find_by_any_email
- 98d2aab8 - Add returning IDs to Gitlab::Database.bulk_insert
- 03f8f3d3 - Cache feature names in RequestStore
- 0cc94878 - Rewrite the GitHub importer from scratch
- 023e5ae2 - Enable the new GitHub importer with a feature flag
Toggle commit list -
7b1d922a...6919f636 - 136 commits from branch
-
marked the task Re-fetch if a PR was updated after the initial repository fetch as completed
Toggle commit list -
changed the description
Toggle commit list -
resolved all discussions
Toggle commit list -
Master
@DouweM While I continue running tests on staging/locally, could you start reviewing this? There's quite a lot to review, so it may be easier to look at the individual commits instead of the diff in its entirety. We can also review this via a video call if that's easier.
-
-
-
-
changed the description
Toggle commit list -
changed the description
Toggle commit list -
changed the description
Toggle commit list -
marked the task Some GitHub pull requests use the same source/target branch, but GitLab does not seem to support this. Example: https://github.com/rubinius/rubinius/pull/704. This may happen when the source repository/project (or user) no longer exists. See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107832/ for more info. as completed
Toggle commit list -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Master
@godfat @jameslopez You two may want to have a look too, since you've been working on the GitHub importer too for https://gitlab.com/gitlab-org/gitlab-ce/issues/36284.
-
marked the task Sometimes inserts into
merge_requestsproduce duplicate key errors in the indexindex_merge_requests_on_target_project_id_and_iid. This is probably due to GitHub sometimes returning duplicate data. We can solve this by not scheduling PRs for which we have already scheduled an importing run (using a Set of iid values). See https://sentry.gitlap.com/gitlab/staginggitlabcom/issues/107834/ for more info. as completedToggle commit list -
added 115 commits
-
8189c5ea...01340794 - 110 commits from branch
master - 27c27955 - Refactor User.find_by_any_email
- ab8dfa8e - Add returning IDs to Gitlab::Database.bulk_insert
- 16166dfe - Cache feature names in RequestStore
- e0b7dad5 - Rewrite the GitHub importer from scratch
- 0d287d71 - Enable the new GitHub importer with a feature flag
Toggle commit list -
8189c5ea...01340794 - 110 commits from branch
-
changed the description
Toggle commit list -
-
-
changed the description
Toggle commit list -
changed the description
Toggle commit list -
changed the description
Toggle commit list -
changed the description
Toggle commit list -
marked the task Handle rescheduled jobs better when cloning (e.g. don't add the remotes if they already exist). Right now a job getting terminated in the middle (e.g. by the memory killer) can prevent us from starting over/resuming as completed
Toggle commit list -
-
resolved all discussions
Toggle commit list -
-
changed the description
Toggle commit list -
MasterEdited by Yorick Peterse
For those seeing this and thinking "Gee, I want to try this out before 10.2.0 releases", here's how:
- SSH into the server(s) running GitLab
- Run
curl -s https://gitlab.com/snippets/1681949/raw -o /tmp/importer.patchon every server - Run
sudo patch -p1 -N -s -d /opt/gitlab/embedded/service/gitlab-rails < /tmp/importer.patchon every server. This assumes GitLab is installed in/opt/gitlab/embedded/service/gitlab-rails(the omnibus default). You may not needsudoif it's installed in a directory your user can access. - Restart GitLab on all servers (Unicorn, Sidekiq, etc)
- Since the code is hidden behind a feature flag (this will be taken out before getting merged0 you will need to enable it for the user you want to use for testing. You can do this by running the following on a server:
sudo gitlab-rails r "Feature.enable(:parallel_github_importer, User.find_by(username: 'username of the user here'))"So for example if the user is "yorickpeterse" you'd run:
sudo gitlab-rails r "Feature.enable(:parallel_github_importer, User.find_by(username: 'yorickpeterse'))"Then you can import the project as usual. If nothing is happening you should make sure that Sidekiq (our background processing code) processes the following two queues:
- github_importer
- github_importer_advance_stage
Omnibus installations will process these queues automatically
The requirement for this is that you have to run at least EE 10.1.0. The supplied patch will not work on CE.
-
added 159 commits
-
8feffb40...64c9d780 - 154 commits from branch
master - 90fb0741 - Refactor User.find_by_any_email
- 2f2f2c2e - Add returning IDs to Gitlab::Database.bulk_insert
- cccb2fe3 - Cache feature names in RequestStore
- 11584bd8 - Rewrite the GitHub importer from scratch
- 161cb5c4 - Enable the new GitHub importer with a feature flag
Toggle commit list -
8feffb40...64c9d780 - 154 commits from branch
-
-
-
Master
Awesome work @yorickpeterse! Would it be possible to split this MR into few iterations?
🤔 -
MasterEdited by Yorick Peterse
@grzesiek No. While there are a few separate commits that could be moved those changes are so small they'd have no impact on the reviewing process. The core changes themselves can't be split up because they don't work on their own / depend on each other.
-
-
-
changed the description
Toggle commit list -
changed the description
Toggle commit list -
changed the description
Toggle commit list -
added 59 commits
-
161cb5c4...4841c3cb - 54 commits from branch
master - 4479845c - Refactor User.find_by_any_email
- cdb671c4 - Add returning IDs to Gitlab::Database.bulk_insert
- 71a34494 - Cache feature names in RequestStore
- ada56efe - Rewrite the GitHub importer from scratch
- 2a528131 - Enable the new GitHub importer with a feature flag
Toggle commit list -
161cb5c4...4841c3cb - 54 commits from branch
-
marked the task The logic used for refreshing repositories before importing PRs doesn't scale. Frequently updated PRs will lead to many refetches of the repository, even if the repository wasn't actually updated. Since we schedule the refs in a job we really only need to refetch if the refs can't be found. This will require a lease however. as completed
Toggle commit list -
marked the task It seems that under certain conditions the page numbers of e.g. the pull requests data isn't tracked properly. It seems to work fine locally, so perhaps something else botched this up. This needs some additional testing. as completed
Toggle commit list -
-
added 247 commits
-
11bf14d1...aaee69c4 - 242 commits from branch
master - b9785387 - Refactor User.find_by_any_email
- 79d1d32f - Add returning IDs to Gitlab::Database.bulk_insert
- 47b39cc1 - Cache feature names in RequestStore
- 335f3742 - Rewrite the GitHub importer from scratch
- 287e54a1 - Enable the new GitHub importer with a feature flag
Toggle commit list -
11bf14d1...aaee69c4 - 242 commits from branch
-
-
-
added 64 commits
-
b68e63e2...21d8ec15 - 59 commits from branch
master - b433a499 - Refactor User.find_by_any_email
- 1f72403a - Add returning IDs to Gitlab::Database.bulk_insert
- b7216bfd - Cache feature names in RequestStore
- 0960651c - Rewrite the GitHub importer from scratch
- 2273d8a1 - Replace old GH importer with the parallel importer
Toggle commit list -
b68e63e2...21d8ec15 - 59 commits from branch
-
changed the description
Toggle commit list -
added 85 commits
-
2273d8a1...c71cf908 - 80 commits from branch
master - 1ed36936 - Refactor User.find_by_any_email
- 04c2b88d - Add returning IDs to Gitlab::Database.bulk_insert
- 08878777 - Cache feature names in RequestStore
- e7c215e6 - Rewrite the GitHub importer from scratch
- becb4499 - Replace old GH importer with the parallel importer
Toggle commit list -
2273d8a1...c71cf908 - 80 commits from branch
-
changed the description
Toggle commit list -
-
-
marked the task API support added as completed
Toggle commit list -
marked the task Tests added for this feature/bug as completed
Toggle commit list -
-
-
MasterEdited by Yorick Peterse
@DouweM Apart from the unrelated static-analysis/karma failures, this is ready for final review. Since the last review a few things have changed such as:
- How we paginate data and resume where we left off (we track things in Redis in an atomic way)
- A bunch of bugs have been fixed (e.g. the code used for handling dead messages was broken)
- The HTTP client has been adjusted to better paginate data
- Feature flag code has been removed and the importer has been made the default, this also means the
Github::Importcode has been removed as it is no longer needed
Most of these however are fairly specific changes, the bigger picture still remains the same.
-
added 75 commits
-
becb4499...666ab488 - 70 commits from branch
master - 2432261a - Refactor User.find_by_any_email
- c703a19e - Add returning IDs to Gitlab::Database.bulk_insert
- af6a1b38 - Cache feature names in RequestStore
- 59d618e5 - Rewrite the GitHub importer from scratch
- 57db4c8a - Replace old GH importer with the parallel importer
Toggle commit list -
becb4499...666ab488 - 70 commits from branch
-
-
-
-
-
-
changed the description
Toggle commit list -
added 54 commits
-
57db4c8a...83e96c7b - 51 commits from branch
master - dea6aa55 - Refactor User.find_by_any_email
- dec01b5c - Add returning IDs to Gitlab::Database.bulk_insert
- 528ea70a - Cache feature names in RequestStore
Toggle commit list -
57db4c8a...83e96c7b - 51 commits from branch
-
added 125 commits
-
d04ad08f...673b6be1 - 120 commits from branch
master - 44be82dd - Refactor User.find_by_any_email
- bda30182 - Add returning IDs to Gitlab::Database.bulk_insert
- 90be53c5 - Cache feature names in RequestStore
- 91c34e0d - Rewrite the GitHub importer from scratch
- 6ce286f9 - Replace old GH importer with the parallel importer
Toggle commit list -
d04ad08f...673b6be1 - 120 commits from branch
-
resolved all discussions
Toggle commit list -
-
approved this merge request
Toggle commit list -
merged
Toggle commit list -
-
Master
EE MR: gitlab-ee!3310 (merged)
-
-
-