Draft: Migrate mentions for issues to DB table (!24149) · Merge requests · GitLab.org / GitLab

Alexandru Croitor requested to merge 198328-migrate-issue-mentions-to-db-table into master Jan 31, 2020

What does this MR do?

This MR contains migrations for issues mentions.

Estimates counts of the records we might be looking at from slack database-lab channel.

Database benchmarks

Number of rows affected

Issues with mentions in description or title ~1796152d

-- all issues that possibly have a mention in description or title: ~1796152
explain select * from issues where (description like '%@%' OR title like '%@%')
 Seq Scan on public.issues  (cost=0.00..7968605.04 rows=1401267 width=4) (actual time=35.232..672139.124 rows=1796152 loops=1)
   Filter: ((issues.description ~~ '%@%'::text) OR ((issues.title)::text ~~ '%@%'::text))
   Rows Removed by Filter: 20650033
   Buffers: shared hit=1453859 read=2533248 dirtied=28321 written=22499
   I/O Timings: read=611404.724 write=706.509

Migrate issue mentions

Query plans for sql queries to migrate mentions for issues based on database-lab data.

Query 1: Get the start ID for the issues to be migrated. planning: 6.423 ms, execution: 48.928 ms

EXPLAIN SELECT  “issues”.“id” FROM “issues”
LEFT JOIN issue_user_mentions on issues.id = issue_user_mentions.issue_id
WHERE ((description like ‘%@%’ OR title like ‘%@%’) AND issue_user_mentions.issue_id is null)
ORDER BY “issues”.“id” ASC LIMIT 1

Click to see the plan.


Limit  (cost=0.73..6.15 rows=1 width=4) (actual time=48.881..48.883 rows=1 loops=1)
   Buffers: shared hit=13 read=27 dirtied=1
   I/O Timings: read=48.477
   ->  Merge Join  (cost=0.73..8122712.11 rows=1498460 width=4) (actual time=48.880..48.880 rows=1 loops=1)
         Buffers: shared hit=13 read=27 dirtied=1
         I/O Timings: read=48.477
         ->  Index Scan using issues_pkey on public.issues  (cost=0.44..8115258.90 rows=1501072 width=4) (actual time=48.773..48.773 rows=1 loops=1)
               Filter: ((issues.description ~~ '%@%'::text) OR ((issues.title)::text ~~ '%@%'::text))
               Rows Removed by Filter: 46
               Buffers: shared hit=10 read=27 dirtied=1
               I/O Timings: read=48.477
         ->  Index Only Scan using issue_user_mentions_on_issue_id_and_note_id_index on public.issue_user_mentions  (cost=0.29..3529.09 rows=70149 width=4) (actual time=0.059..0.060 rows=1 loops=1)
               Heap Fetches: 0
               Buffers: shared hit=3
Time: 55.351 ms

planning: 6.423 ms
execution: 48.928 ms

I/O read: 48.477 ms
I/O write: 0.000 ms



Shared buffers:

hits: 13 (~104.00 KiB) from the buffer pool
reads: 27 (~216.00 KiB) from the OS file cache, including disk I/O
dirtied: 1 (~8.00 KiB)
writes: 0

Query 2: Get max id for the 10K batch rows for issues to be migrated. COLD RUN planning: 6.320 ms, execution: 1.394 min. SECOND RUN: planning: 6.327 ms, execution: 409.221 ms

EXPLAIN SELECT  “issues”.“id” FROM “issues”
LEFT JOIN issue_user_mentions on issues.id = issue_user_mentions.issue_id
WHERE ((description like ‘%@%’ OR title like ‘%@%’) AND issue_user_mentions.issue_id is null)
AND “issues”.“id” >= 1 ORDER BY “issues”.“id” ASC LIMIT 1 OFFSET 10000

Click to see the plan.


Limit  (cost=54592.08..54597.54 rows=1 width=4) (actual time=83660.833..83660.836 rows=1 loops=1)
   Buffers: shared hit=30439 read=82166 dirtied=359
   I/O Timings: read=82475.659
   ->  Merge Join  (cost=0.73..8180296.49 rows=1498460 width=4) (actual time=0.127..83657.799 rows=10001 loops=1)
         Buffers: shared hit=30439 read=82166 dirtied=359
         I/O Timings: read=82475.659
         ->  Index Scan using issues_pkey on public.issues  (cost=0.44..8172843.28 rows=1501072 width=4) (actual time=0.079..83642.766 rows=10001 loops=1)
               Index Cond: (issues.id >= 1)
               Filter: ((issues.description ~~ '%@%'::text) OR ((issues.title)::text ~~ '%@%'::text))
               Rows Removed by Filter: 130518
               Buffers: shared hit=30436 read=82166 dirtied=359
               I/O Timings: read=82475.659
         ->  Index Only Scan using issue_user_mentions_on_issue_id_and_note_id_index on public.issue_user_mentions  (cost=0.29..3529.09 rows=70149 width=4) (actual time=0.046..0.050 rows=2 loops=1)
               Heap Fetches: 0
               Buffers: shared hit=3
COLD RUN:
Time: 1.394 min

planning: 6.320 ms
execution: 1.394 min

I/O read: 1.375 min
I/O write: 0.000 ms



Shared buffers:

hits: 30439 (~237.80 MiB) from the buffer pool
reads: 82166 (~641.90 MiB) from the OS file cache, including disk I/O
dirtied: 359 (~2.80 MiB)
writes: 0

SECOND RUN:
Time: 415.548 ms

planning: 6.327 ms
execution: 409.221 ms

I/O read: 0.000 ms
I/O write: 0.000 ms



Shared buffers:

hits: 112605 (~879.70 MiB) from the buffer pool
reads: 0 from the OS file cache, including disk I/O
dirtied: 0
writes: 0

Query 3: Get min id and max id for issues mentions to be migrated

EXPLAIN SELECT  min(“issues”.“id”),  max(“issues”.“id”) FROM “issues”
LEFT JOIN issue_user_mentions on issues.id = issue_user_mentions.issue_id
WHERE ((description like ‘%@%’ OR title like ‘%@%’) AND issue_user_mentions.issue_id is null)
AND “issues”.“id” >= 400000 AND “issues”.“id” < 500000

Click to see the plan before adding index.


Aggregate  (cost=76846.94..76846.95 rows=1 width=8) (actual time=33763.582..33763.583 rows=1 loops=1)
   Buffers: shared hit=50382 read=28609 dirtied=135
   I/O Timings: read=32746.547
   ->  Nested Loop  (cost=0.73..76825.48 rows=4293 width=4) (actual time=5.283..33750.906 rows=9860 loops=1)
         Buffers: shared hit=50382 read=28609 dirtied=135
         I/O Timings: read=32746.547
         ->  Index Scan using issues_pkey on public.issues  (cost=0.44..74787.44 rows=4300 width=4) (actual time=5.262..33639.708 rows=9860 loops=1)
               Index Cond: ((issues.id >= 400000) AND (issues.id < 500000))
               Filter: ((issues.description ~~ '%@%'::text) OR ((issues.title)::text ~~ '%@%'::text))
               Rows Removed by Filter: 52847
               Buffers: shared hit=30662 read=28609 dirtied=135
               I/O Timings: read=32746.547
         ->  Index Only Scan using issue_user_mentions_on_issue_id_and_note_id_index on public.issue_user_mentions  (cost=0.29..0.66 rows=2 width=4) (actual time=0.007..0.007 rows=0 loops=9860)
               Index Cond: (issue_user_mentions.issue_id = issues.id)
               Heap Fetches: 0
               Buffers: shared hit=19720
Time: 33.774 s

planning: 9.843 ms
execution: 33.764 s

I/O read: 32.747 s
I/O write: 0.000 ms



Shared buffers:

hits: 50382 (~393.60 MiB) from the buffer pool
reads: 28609 (~223.50 MiB) from the OS file cache, including disk I/O
dirtied: 135 (~1.10 MiB)
writes: 0

Click to see the plan after adding index.


Aggregate  (cost=2360.68..2360.69 rows=1 width=8) (actual time=176.751..176.751 rows=1 loops=1)
   Buffers: shared hit=22931 read=125
   I/O Timings: read=153.980
   ->  Nested Loop  (cost=0.72..2339.21 rows=4295 width=4) (actual time=0.320..174.979 rows=9860 loops=1)
         Buffers: shared hit=22931 read=125
         I/O Timings: read=153.980
         ->  Index Only Scan using issue_mentions_temp_index on public.issues  (cost=0.43..299.55 rows=4302 width=4) (actual time=0.310..158.941 rows=9860 loops=1)
               Index Cond: ((issues.id >= 400000) AND (issues.id < 500000))
               Heap Fetches: 119
               Buffers: shared hit=3211 read=125
               I/O Timings: read=153.980
         ->  Index Only Scan using issue_user_mentions_on_issue_id_and_note_id_index on public.issue_user_mentions  (cost=0.29..0.66 rows=2 width=4) (actual time=0.001..0.001 rows=0 loops=9860)
               Index Cond: (issue_user_mentions.issue_id = issues.id)
               Heap Fetches: 0
               Buffers: shared hit=19720
Time: 189.614 ms

planning: 12.790 ms
execution: 176.824 ms

I/O read: 153.980 ms
I/O write: 0.000 ms



Shared buffers:

hits: 22931 (~179.10 MiB) from the buffer pool
reads: 125 (~1000.00 KiB) from the OS file cache, including disk I/O
dirtied: 0
writes: 0

Index creation

exec CREATE INDEX issue_mentions_temp_index ON issues (id) WHERE title ~~ ‘%@%‘::text OR description ~~ ‘%@%’::text

The query has been executed. Duration: 21.199 min

NOTE: concurrent index creation takes 2-3x more time.

Migration Runtime estimates:

Issue title and description mentions migration: ~600 - ~900 mins, i.e. ~10h - ~17h

Concurrent Index creation: ~60-90 mins
~1796152d records in 10K batches results in ~180 background migration jobs. Scheduling a single job should take ~1s so ~3 mins to schedule all jobs. I was able to only run jobs locally which showed me a runtime foreach each job between 3-5 mins with 2 mins delay between jobs, which results in estimative running times of 180 jobs * 3 mins = 540 mins - 180 jobs * 5 mins = 900 mins

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

re #198328

Edited Jul 28, 2020 by Alexandru Croitor

Draft: Migrate mentions for issues to DB table