Skip to content

GitHub import: alternative to single_endpoint_notes_import for very large projects

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Background

I was just catching up on this interesting idea/issue about enabling a 2-step GitHub import: GitHub Import - Execute migration in two phases (#431603 - closed)

One thing that issue mentions is that collection endpoints (where we return paginated results of records) are much faster than single endpoints (where we retrieve just one record or a list of records for just one resource). In some cases, we use a single endpoint when certain options are passed. That is the case the the single_endpoint_notes_import option.

I dug into the reason why we have the single_endpoint_notes_import method. This is due to the limitation described here:

The GitHub API has a limit that prevents more than approximately 30,000 notes or diff notes from being imported. When this limit is reached, the GitHub API instead returns the following error:

In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse.

I confirmed that this error still exists and is oddly not documented anywhere on the GitHub side. 🙃

Suggestion

I think that we can support projects with over 30,000 issues notes and use the collection endpoints by using the since query param for the GitHub API.

The "Diff Notes" importer, when single_endpoint_notes_import is not set, uses the Octokit::Client.pull_requests_comments method (which ParallelScheduling uses to iterate though each page). The pull_requests_comments method uses this GitHub API endpoint. The endpoint takes a sort query param, which determines which attribute we sort by, and a since query, which only show results that were last updated after the given time. Octokit already supports passing these params.

If we use these query param to return records in order of when they were last updated, we can take the timestamp of the last record (on page 300) and use that to start paginating through the next collection of 300,000 records:

For example:

  • First page of search for issue notes: https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100
  • Last page of search for issue notes due to GitHub pagination limits: https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=300
  • When we go to next page, we get the pagination error: https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=301
  • So, we take the updated_at timestamp of the last record on page 300 and use that to start a new search (the first result on this page is the last result on page 300 from our first search): https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=1&since=2015-09-28T09:45:18Z
  • When we get to page 300 of the the search with since=2015-09-28T09:45:18Z in the params, we fetch the new last updated_at timestamp for the final record in the batch and start going through the next batch of 300,000 records: https://api.github.com/repositories/27193779/issues/comments?sort=updated&direction=asc&per_page=100&page=1&since=2016-04-04T21:49:36Z
  • and so on and so on until we've gone through all records

Caveats

If we go with this method for issue notes imports, we need to make sure that our issue notes importer knows to skip any already imported issue comments. Currently, we paginate by the default, which is created_at, so there is never a risk of re-importing a comment. But updated_at can change so it is always possible that we import an issue comment and then try to re-import it in another job later because it was updated.

Edited by 🤖 GitLab Bot 🤖