Add alternative way of importing comments with GitHub Importer (!67150) · Merge requests · GitLab.org / GitLab

George Koltsov requested to merge georgekoltsov/update-github-notes-importer into master Jul 29, 2021

What does this MR do?

This MR adds an alternative way of importing comments and diff comments to GitHub Importer.

Why?

When GitHub Importer runs on extremely large projects (e.g. nodejs/node repository) not all comments and diff comments are being imported. After investigating what could be wrong, I found that GitHub API endpoints that we use to fetch diff and regular notes (pull_requests_comments & issues_comments endpoints) are limited by GitHub and not all pages can be fetched.

At some point for an import of a large project the GitHub API will start following the following error: "In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse." which means we can't fetch all of the available comments via the endpoints above.

Example: https://api.github.com/repositories/27193779/issues/comments?page=401&per_page=100

To double check, I went through all 400 pages one by one (with a script) looking for this specific comment https://github.com/nodejs/node/pull/21128#issuecomment-394515601 and indeed it wasn't there. This means that we can only import 40000 regular comments that are shared between issues and MRs (for this particular repo).

Different repositories might have different limits. It's usually returned in rels[:last] response from GitHub and I don't think this is a global limitation. I'd imagine different repos have different limits, depending on how big they are.

Because of that, I am introducing an alternative approach for large imports that is behind a few feature flags:

`github_importer_single_endpoint_notes_import`

Mind that in our codebase comments are notes - therefore name of the FF.

Updates diff notes (diff comments) & notes (comments) stages to use a similar approach to reviews & merged by stages, where instead of using issues_comments & pull_requests_comments endpoints (that return ALL comments per repository, paginated), we fetch 1 MR's notes at a time, since individual endpoints have all comments present. This will allow us to carry over missing comments, however because we have to do 1 request per MR to fetch comments, this will increase the number of network requests required to perform the import, which means its execution is going to be even longer.

A lot of the changes introduced are very similar to Gitlab::GithubImport::Importer::PullRequestsReviewsImporter.

Additionally, extremely large repos import can run for extreme amounts of time. For example, nodejs import with proposed changes runs for 21hours, mostly because the importer has to wait around for rate limit to be lifted. GitHub Importer caches issues/mrs ids in redis for 24 hours (https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/cache/import/caching.rb#L8), but for even larger repos if the keys expire then import will not fully complete, as if the import runs for longer than 24 hours, the keys will expire and notes wont be imported. This feature flag also changes default key timeout from 24 to 78 hours.

`github_importer_lower_per_page_limit`

For extremely large repos even pull_request_comments endpoint returning 100 items can return 500/502 bit GitHub API (Example: https://api.github.com/repos/nodejs/node/pulls/1159/comments?page=1&per_page=100) which can fail the import entirely. Because of that, introduced feature flag reduces page size from 100 to 50 in order to try and mitigate this issue.

Both feature flags are scope to project's group and should not be enabled globally (especially on .com as all imports are going to be affected)
Both feature flags are marked as ops as per suggestions in https://docs.gitlab.com/ee/development/feature_flags/#ops-type long-lived, altering behaviour
Both feature flags main purpose is to support Professional Services large customer migrations, where default GitHub Importer behaviour is not enough

Mentions #332630 (closed)

DB queries & execution plans

project.merge_requests.where.not(iid: [...]).each_batch(of: 100, column: :iid)

explain SELECT "merge_requests"."iid" FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 278964 AND "merge_requests"."iid" NOT IN (1, 2, 3) ORDER BY "merge_requests"."iid" ASC LIMIT 1

https://gitlab.slack.com/archives/CLJMDRD8C/p1628592868072600

 Limit  (cost=0.57..0.60 rows=1 width=4) (actual time=8.663..8.664 rows=1 loops=1)
   Buffers: shared read=6 dirtied=1
   I/O Timings: read=8.550 write=0.000
   ->  Index Only Scan using index_merge_requests_on_target_project_id_and_iid on public.merge_requests  (cost=0.57..1837.71 rows=56842 width=4) (actual time=8.660..8.661 rows=1 loops=1)
         Index Cond: (merge_requests.target_project_id = 278964)
         Heap Fetches: 1
         Filter: (merge_requests.iid <> ALL ('{1,2,3}'::integer[]))
         Rows Removed by Filter: 3
         Buffers: shared read=6 dirtied=1
         I/O Timings: read=8.550 write=0.000

Time: 12.648 ms
  - planning: 3.929 ms
  - execution: 8.719 ms
    - I/O read: 8.550 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 0 from the buffer pool
  - reads: 6 (~48.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 1 (~8.00 KiB)
  - writes: 0

explain SELECT "merge_requests"."iid" FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 278964 AND "merge_requests"."iid" NOT IN (1, 2, 3) AND "merge_requests"."iid" >= 4 ORDER BY "merge_requests"."iid" ASC LIMIT 1 OFFSET 100

https://gitlab.slack.com/archives/CLJMDRD8C/p1628592954075100

 Limit  (cost=4.07..4.10 rows=1 width=4) (actual time=13.572..13.574 rows=1 loops=1)
   Buffers: shared hit=41 read=66 dirtied=3
   I/O Timings: read=13.178 write=0.000
   ->  Index Only Scan using index_merge_requests_on_target_project_id_and_iid on public.merge_requests  (cost=0.57..1865.57 rows=53297 width=4) (actual time=0.048..13.557 rows=101 loops=1)
         Index Cond: ((merge_requests.target_project_id = 278964) AND (merge_requests.iid >= 4))
         Heap Fetches: 8
         Buffers: shared hit=41 read=66 dirtied=3
         I/O Timings: read=13.178 write=0.000

Time: 14.206 ms
  - planning: 0.600 ms
  - execution: 13.606 ms
    - I/O read: 13.178 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 41 (~328.00 KiB) from the buffer pool
  - reads: 66 (~528.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 3 (~24.00 KiB)
  - writes: 0

explain SELECT "merge_requests".* FROM "merge_requests" WHERE "merge_requests"."target_project_id" = 278964 AND "merge_requests"."iid" NOT IN (1, 2, 3) AND "merge_requests"."iid" >= 4 AND "merge_requests"."iid" < 104

https://gitlab.slack.com/archives/CLJMDRD8C/p1628593004077600

 Index Scan using index_merge_requests_on_target_project_id_and_iid on public.merge_requests  (cost=0.57..28421.69 rows=21448 width=767) (actual time=0.017..239.980 rows=100 loops=1)
   Index Cond: ((merge_requests.target_project_id = 278964) AND (merge_requests.iid >= 4) AND (merge_requests.iid < 104))
   Buffers: shared hit=13 read=92
   I/O Timings: read=238.681 write=0.000

Time: 242.543 ms
  - planning: 2.409 ms
  - execution: 240.134 ms
    - I/O read: 238.681 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 13 (~104.00 KiB) from the buffer pool
  - reads: 92 (~736.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

project.issues.where.not(iid: [...]).each_batch(of: 100, column: :iid)

explain SELECT "issues"."iid" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" NOT IN (1, 2, 3) ORDER BY "issues"."iid" ASC LIMIT 1

https://gitlab.slack.com/archives/CLJMDRD8C/p1628593079080100

 Limit  (cost=0.57..0.62 rows=1 width=4) (actual time=8.956..8.958 rows=1 loops=1)
   Buffers: shared hit=1 read=9 dirtied=1
   I/O Timings: read=7.820 write=0.000
   ->  Index Only Scan using index_issues_on_project_id_and_iid on public.issues  (cost=0.57..4441.59 rows=85529 width=4) (actual time=8.953..8.954 rows=1 loops=1)
         Index Cond: (issues.project_id = 278964)
         Heap Fetches: 2
         Filter: (issues.iid <> ALL ('{1,2,3}'::integer[]))
         Rows Removed by Filter: 3
         Buffers: shared hit=1 read=9 dirtied=1
         I/O Timings: read=7.820 write=0.000

Time: 11.617 ms
  - planning: 2.631 ms
  - execution: 8.986 ms
    - I/O read: 7.820 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 1 (~8.00 KiB) from the buffer pool
  - reads: 9 (~72.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 1 (~8.00 KiB)
  - writes: 0

explain SELECT "issues"."iid" FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" NOT IN (1, 2, 3) AND "issues"."iid" >= 4 ORDER BY "issues"."iid" ASC LIMIT 1 OFFSET 100

https://gitlab.slack.com/archives/CLJMDRD8C/p1628593114082600

 Limit  (cost=6.02..6.08 rows=1 width=4) (actual time=37.196..37.198 rows=1 loops=1)
   Buffers: shared hit=40 read=53 dirtied=1
   I/O Timings: read=36.717 write=0.000
   ->  Index Only Scan using index_issues_on_project_id_and_iid on public.issues  (cost=0.57..4428.70 rows=81159 width=4) (actual time=0.058..37.180 rows=101 loops=1)
         Index Cond: ((issues.project_id = 278964) AND (issues.iid >= 4))
         Heap Fetches: 4
         Buffers: shared hit=40 read=53 dirtied=1
         I/O Timings: read=36.717 write=0.000

Time: 37.900 ms
  - planning: 0.664 ms
  - execution: 37.236 ms
    - I/O read: 36.717 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 40 (~320.00 KiB) from the buffer pool
  - reads: 53 (~424.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 1 (~8.00 KiB)
  - writes: 0

explain SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 278964 AND "issues"."iid" NOT IN (1, 2, 3) AND "issues"."iid" >= 4 AND "issues"."iid" < 104

https://gitlab.slack.com/archives/CLJMDRD8C/p1628593150085100

 Index Scan using index_issues_on_project_id_and_iid on public.issues  (cost=0.57..31997.32 rows=22931 width=1327) (actual time=1.244..198.333 rows=100 loops=1)
   Index Cond: ((issues.project_id = 278964) AND (issues.iid >= 4) AND (issues.iid < 104))
   Buffers: shared hit=10 read=93
   I/O Timings: read=197.090 write=0.000

Time: 201.032 ms
  - planning: 2.546 ms
  - execution: 198.486 ms
    - I/O read: 197.090 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 10 (~80.00 KiB) from the buffer pool
  - reads: 93 (~744.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Screenshots or Screencasts (strongly suggested)

How to setup and validate locally (strongly suggested)

Create a new group locally for project to be imported into
Enable feature flags

rails console

Feature.enable(:github_importer_lower_per_page_limit, group)
Feature.enable(:github_importer_single_endpoint_notes_import, group)

Start a new GitHub Import via UI or API of a project that contains issue/mr notes & diff notes
Verify that imported project contains all issue/mr notes & diff notes

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
I have tested this MR in all supported browsers, or it's not needed.
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Oct 10, 2022 by Magdalena Frankiewicz

Add alternative way of importing comments with GitHub Importer