Fix GitHub Import stage pagination
What does this MR do and why?
This code changes how the GitHub import system handles pagination when fetching data from GitHub's API.
GitHub no longer supports manual pagination in all cases, and requests may fail when importing projects with a large number of issues or pull requests. In other words, we can no longer paginate the API results simply by increasing the page query string; instead, we should paginate the results using the links provided in the response header, which includes a cursor.
This update changes GithubImport::Client#each_page to accept a resume URL, which, when provided, allows the method to continue pagination from that point. Additionally, it updates the Stages workers that use this method to cache the last processed URL and pass it along when resuming.
Note that, as before, in case of an interruption, GitHub Import can still request the same page. If this happens, no duplicated object will be imported because GitHub Import also caches each processed object.
References
Screenshots or screen recordings
| Before | After |
|---|---|
How to set up and validate locally
To test the change, stage workers must be interrupted, and you should verify if they resume from their previous position. To determine where they left off, log all URLs of GitHub import requests and verify whether GitHub import proceeds with the sequence of pages.
To log all GitHub Import requests, add the following line to the code:
diff --git a/lib/gitlab/octokit/url_validation.rb b/lib/gitlab/octokit/url_validation.rb
index d8322aa224a7..ce2aae748820 100644
--- a/lib/gitlab/octokit/url_validation.rb
+++ b/lib/gitlab/octokit/url_validation.rb
@@ -8,6 +8,8 @@ def initialize(app)
end
def call(env)
+ ::Import::Framework::Logger.info(url: env[:url], message: "Github Outgoing request")
+
Gitlab::HTTP_V2::UrlBlocker.validate!(env[:url],
schemes: %w[http https],
allow_localhost: allow_local_requests?,
Tail the importer.log to monitor the fetched URLs
tail -f log/importer.log | grep --line-buffered "Github Outgoing request" | jq .URL
Also, you need to interrupt the stage jobs. One option is to stop Sidekiq to prevent the job from being placed back in the queue, then restart Sidekiq again.
You can also change lower Gitlab::GithubImport::Client::DEFAULT_PER_PAGE to increase the number of requests.
- Apply the diff to log the requests
- Start a Github Import migration
- Interrupt Sidekiq multiple times
- Check if the pagination resumes
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.