GitHub import: alternative to `single_endpoint_notes_import` for very large projects (#432679) · Issues · GitLab.org / GitLab

GitHub import: alternative to `single_endpoint_notes_import` for very large projects

<details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=432679) </details>  ## Background I was just catching up on this interesting idea/issue about enabling a 2-step GitHub import: https://gitlab.com/gitlab-org/gitlab/-/issues/431603+ One thing that issue mentions is that collection endpoints (where we return paginated results of records) are much faster than single endpoints (where we retrieve just one record or a list of records for just one resource). In some cases, we use a single endpoint when certain options are passed. That is the case the the `single_endpoint_notes_import` option. I dug into the reason why we have the `single_endpoint_notes_import` method. This is due to the limitation described [here](https://docs.gitlab.com/ee/user/project/import/github.html#missing-comments): > The GitHub API has a limit that prevents more than approximately 30,000 notes or diff notes from being imported. When this limit is reached, the GitHub API instead returns the following error: > In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse. I confirmed that this error still exists and is oddly not documented anywhere on the GitHub side. :upside_down: ## Suggestion I think that we can support projects with over 30,000 issues notes *and* use the collection endpoints by using the `since` query param for the GitHub API. The "Diff Notes" importer, when `single_endpoint_notes_import` is *not* set, uses the `Octokit::Client.pull_requests_comments` method (which `ParallelScheduling` uses to iterate though each page). The `pull_requests_comments` method uses [this GitHub API endpoint](https://docs.github.com/en/rest/pulls/comments?apiVersion=2022-11-28#list-comments-in-a-repository). The endpoint takes a `sort` query param, which determines which attribute we sort by, and a `since` query, which only show results that were last updated after the given time. [Octokit already supports passing these params](https://octokit.github.io/octokit.rb/Octokit/Client/PullRequests.html#pull_requests_comments-instance_method). If we use these query param to return records in order of when they were last `updated`, we can take the timestamp of the last record (on page 300) and use that to start paginating through the next collection of 300,000 records: For example: * First page of search for issue notes: `https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100` * Last page of search for issue notes due to GitHub pagination limits: `https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=300` * When we go to next page, we get the pagination error: `https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=301` * So, we take the `updated_at` timestamp of the last record on page 300 and use that to start a new search (the first result on this page is the last result on page 300 from our first search): `https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=1&since=2015-09-28T09:45:18Z` * When we get to page 300 of the the search with `since=2015-09-28T09:45:18Z` in the params, we fetch the _new_ last `updated_at` timestamp for the final record in the batch and start going through the next batch of 300,000 records: `https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=1&since=2016-04-04T21:49:36Z` * and so on and so on until we've gone through all records ## Caveats If we go with this method for issue notes imports, we need to make sure that our issue notes importer knows to skip any already imported issue comments. Currently, we paginate by the default, which is `created_at`, so there is never a risk of re-importing a comment. But `updated_at` can change so it is always possible that we import an issue comment and then try to re-import it in another job later because it was updated.

issue