Import open MR and issues first to reduce migration downtime - GitHub Import
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
This issue is related to the idea I commented on #428170 (comment 1649032555), and similar to GitHub Import - Execute migration in two phases (#431603 - closed) proposes a different way to migrate GitHub projects.
Context
Some GitHub Import stages have to make one request per issue or MR to migrate a piece of information. In our codebase, we call them single endpoints. In total, 6 stages use single endpoints, as mentioned in #431603 (closed)
As mentioned in #432679 (comment 1674316712), the time to execute these 6 stages on a project that has 50K issues + 50K MR is around 5 days due to GitHub API rate limit
GitHub Stages using single and collection endpoint
It's important to note that during the GitHub Import process, certain stages may take longer than others. This is because an API request is needed for each Merge Request (MR) or Issue. The following table illustrates the workers involved in the process and whether they use a single endpoint (which is around 100 times slower), or a collection endpoint (which is faster).
| Stage | Uses single/Collect endpoint | Description about the worker |
|---|---|---|
| ImportRepositoryWorker | This worker imports the repository and wiki, scheduling the next stage when done. | |
| ImportBaseDataWorker | This worker imports base data such as labels, milestones, and releases. | |
| ImportPullRequestsWorker | Collection endpoint |
This worker imports all pull requests. |
| ImportCollaboratorsWorker | Collection endpoint |
This worker imports only direct repository collaborators who are not outside collaborators. |
| ImportPullRequestsMergedByWorker | Single endpoint |
This worker imports the pull requests’ merged-by user information. |
| ImportPullRequestsReviewRequestsWorker | Single endpoint |
This worker imports assigned reviewers of pull requests. |
| ImportPullRequestsReviewsWorker | Single endpoint |
This worker imports reviews of pull requests. |
| ImportIssuesAndDiffNotesWorker | For issues, it uses a collection endpoint For DiffNotes, single endpoint single_endpoint_notes_import is true |
This worker imports all issues, and pull requests diff notes |
| ImportIssueEventsWorker | Single endpoint |
This worker imports all issues and pull requests events. |
| ImportNotesWorker | Single endpoint single_endpoint_notes_import is true |
This worker imports regular comments for issues and pull requests. |
| ImportAttachmentsWorker | Collection endpoint |
This worker imports note attachments linked inside Markdown. |
| ImportProtectedBranchesWorker | Single endpoint, but not relevant |
This worker imports protected branch rules. |
Problem
It is not ideal for the customer to have to wait for 5 days before they can start using the project. Ideally, the customer should be able to start using the project in 2 days, allowing for a migration to take place on a Friday evening and the use of the migrated project on Monday morning.
Proposed idea
In order to reduce the required downtime, GitHub Import could migrate on a first phase only essential information, and then allow customers to start using the project and in the background migrate the remaining information.
GitHub Import could be divided into 3 phases as described below. After phase 1, users would be allowed to use the project, for example, create issues, merge requests, set up pipelines, etc.
Phase 1:
- Repository
- Labels
- Releases
- Releases attachments
- Milestones
- Protected branches
- Collaborators
- LFS objects
- Reserve merge request IID
- Reserve issues IID
Phase 2:
- Migrate open issue
- Issue attachments
- Notes
- Note attachments
- Events
- Migrate open merge requests
- MR attachments
- Notes
- Note attachments
- DiffNotes
- Merge by
- Pull request reviews
- Pull request reviewers
- Events
Phase 3:
- Migrate remaining issues
- Issue attachments
- Notes
- Note attachments
- Events
- Migrate remaining merge requests
- MR attachments
- Notes
- Note attachments
- DiffNotes
- Merge by
- Pull request reviews
- Pull request reviewers
- Events
Challenges
Since users will be able to use the project after phase 1. Therefore, the GitHub Import should be able to handle any modifications that users make during the migration process. For instance, users might delete labels, milestones, or change their public email addresses. Thus, the GitHub Import must be resilient enough to handle such scenarios.
Also, since the migration of merge requests and issues occurs in phases, for example first, the merge request is created, then later reviewers, reviews, events, and merge by information are migrated, users could perform actions that could cause issues. For example, users could change the merge request reviewers before the information is migrated and later GitHub Import changes the reviewers. Perhaps we would have to introduce some kind of lock to prevent changes while a record is still migrating.