Draft: Migrate mentions for issue notes to DB table (!24266) · Merge requests · GitLab.org / GitLab

Alexandru Croitor requested to merge 198337-migrate-issue-notes-mentions-to-db-table into 198328-migrate-issue-mentions-to-db-table Feb 03, 2020

What does this MR do?

This MR contains migrations for issue notes mentions.

All queries and estimations are based on database-lab channel.

Database benchmarks

Number of rows affected

Issues notes with mentions ~14720459

-- issue notes that have at least one mention ~14720459
explain select * from notes where noteable_type = 'Issue' and note like '%\@%'

 Seq Scan on public.notes  (cost=0.00..81296854.08 rows=8761665 width=4) (actual time=3.154..1747987.120 rows=14720459 loops=1)
   Filter: ((notes.note ~~ '%\@%'::text) AND ((notes.noteable_type)::text = 'Issue'::text))
   Rows Removed by Filter: 214948219
   Buffers: shared hit=1109341 read=19928387 dirtied=175572 written=133906
   I/O Timings: read=1572330.794 write=5860.768

Migrate issue notes mentions

Query plans for sql queries to migrate mentions for issues notes based on database-lab data.

Any query ran on notes table is quite slow, that is why we beed to add index.

exec CREATE INDEX issue_notes_mentions_temp_index ON notes (id) 
WHERE note ~~‘%@%’::text AND notes.noteable_type = ‘Issue’

The query has been executed. Duration: 45.198 min

NOTE: In the migration the index will be created concurrently. CONCURRENT index creation takes at 2-3x more time, but it is non blocking.

exec CREATE INDEX CONCURRENTLY issue_notes_mentions_temp_index ON notes (id) 
WHERE note ~~‘%@%’::text AND notes.noteable_type = ‘Issue’

Query 1: Get the start ID for the issues to be migrated. planning: TBD ms, execution: TBD ms

explain SELECT  “notes”.“id” FROM “notes”
LEFT JOIN issue_user_mentions ON notes.id = issue_user_mentions.note_id
WHERE (note LIKE ‘%@%’ AND issue_user_mentions.issue_id IS NULL
AND notes.noteable_type = ‘Issue’) ORDER BY notes.id LIMIT 1

Click to see the plan WITHOUT index: ~50mins.


Limit  (cost=85243925.59..85243925.60 rows=1 width=4) (actual time=3053481.600..3053481.645 rows=1 loops=1)
   Buffers: shared hit=1153431 read=20885874 dirtied=122562 written=116598
   I/O Timings: read=2822280.286 write=7061.589
   ->  Sort  (cost=85243925.59..85243925.60 rows=1 width=4) (actual time=3053481.586..3053481.586 rows=1 loops=1)
         Sort Key: notes.id
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=1153431 read=20885874 dirtied=122562 written=116598
         I/O Timings: read=2822280.286 write=7061.589
         ->  Hash Left Join  (cost=4218.35..85243925.58 rows=1 width=4) (actual time=185.842..3045654.497 rows=15183109 loops=1)
               Hash Cond: (notes.id = issue_user_mentions.note_id)
               Filter: (issue_user_mentions.issue_id IS NULL)
               Rows Removed by Filter: 51514
               Buffers: shared hit=1153428 read=20885874 dirtied=122562 written=116598
               I/O Timings: read=2822280.286 write=7061.589
               ->  Seq Scan on public.notes  (cost=0.00..85203371.14 rows=9683645 width=4) (actual time=6.132..3028100.060 rows=15234623 loops=1)
                     Filter: ((notes.note ~~ '%@%'::text) AND ((notes.noteable_type)::text = 'Issue'::text))
                     Rows Removed by Filter: 224508348
                     Buffers: shared hit=1153427 read=20885215 dirtied=122445 written=116598
                     I/O Timings: read=2822139.438 write=7061.589
               ->  Hash  (cost=3341.49..3341.49 rows=70149 width=8) (actual time=179.017..179.017 rows=55724 loops=1)
                     Buckets: 131072  Batches: 1  Memory Usage: 3201kB
                     Buffers: shared hit=1 read=659 dirtied=117
                     I/O Timings: read=140.848
                     ->  Seq Scan on public.issue_user_mentions  (cost=0.00..3341.49 rows=70149 width=8) (actual time=2.552..158.818 rows=70348 loops=1)
                           Buffers: shared hit=1 read=659 dirtied=117
                           I/O Timings: read=140.848
Time: 50.891 min

planning: 5.572 ms
execution: 50.891 min

I/O read: 47.038 min
I/O write: 7.062 s



Shared buffers:

hits: 1153431 (~8.80 GiB) from the buffer pool
reads: 20885874 (~159.30 GiB) from the OS file cache, including disk I/O
dirtied: 122562 (~957.50 MiB)
writes: 116598 (~910.90 MiB)

Click to see the plan WITH index: ~20ms.

Limit (cost=0.72..233260.63 rows=1 width=4) (actual time=13.054..13.057 rows=1 loops=1) Buffers: shared read=7 I/O Timings: read=12.828 -> Merge Join (cost=0.72..233260.63 rows=1 width=4) (actual time=13.054..13.054 rows=1 loops=1) Filter: (issue_user_mentions.issue_id IS NULL) Rows Removed by Filter: 0 Buffers: shared read=7 I/O Timings: read=12.828 -> Index Only Scan using issue_notes_mentions_temp_index on public.notes (cost=0.43..198211.72 rows=9676352 width=4) (actual time=6.977..6.977 rows=1 loops=1) Heap Fetches: 0 Buffers: shared read=4 I/O Timings: read=6.821 -> Index Scan using index_issue_user_mentions_on_note_id on public.issue_user_mentions (cost=0.29..14865.42 rows=70149 width=8) (actual time=6.057..6.057 rows=1 loops=1) Buffers: shared read=3 I/O Timings: read=6.007

Time: 20.940 ms

planning: 7.789 ms
execution: 13.151 ms
- I/O read: 12.828 ms
- I/O write: 0.000 ms

Shared buffers:

hits: 0 from the buffer pool
reads: 7 (~56.00 KiB) from the OS file cache, including disk I/O
dirtied: 0
writes: 0

Query 2: Get max id for the 10K batch rows for issues to be migrated.

explain SELECT “notes”.“id” FROM “notes”
LEFT JOIN issue_user_mentions ON notes.id = issue_user_mentions.note_id
WHERE (note LIKE ‘%@%’ AND issue_user_mentions.issue_id IS NULL
AND notes.noteable_type = ‘Issue’) AND notes.id >= 5450000 ORDER BY notes.id ASC LIMIT 1 OFFSET 10000

Click to see the plan: Time: 236.930 ms


Limit  (cost=253244.25..506487.77 rows=1 width=4) (actual time=229.346..229.348 rows=1 loops=1)
   Buffers: shared hit=5115 read=225
   I/O Timings: read=218.214
   ->  Merge Join  (cost=0.72..253244.25 rows=1 width=4) (actual time=0.307..228.786 rows=10001 loops=1)
         Filter: (issue_user_mentions.issue_id IS NULL)
         Rows Removed by Filter: 0
         Buffers: shared hit=5115 read=225
         I/O Timings: read=218.214
         ->  Index Only Scan using issue_notes_mentions_temp_index on public.notes  (cost=0.43..218608.58 rows=9511208 width=4) (actual time=0.282..226.530 rows=10001 loops=1)
               Index Cond: (notes.id >= 5450000)
               Heap Fetches: 0
               Buffers: shared hit=5112 read=225
               I/O Timings: read=218.214
         ->  Index Scan using index_issue_user_mentions_on_note_id on public.issue_user_mentions  (cost=0.29..14865.42 rows=70149 width=8) (actual time=0.021..0.021 rows=1 loops=1)
               Buffers: shared hit=3
Time: 236.930 ms

planning: 7.497 ms
execution: 229.433 ms

I/O read: 218.214 ms
I/O write: 0.000 ms



Shared buffers:

hits: 5115 (~40.00 MiB) from the buffer pool
reads: 225 (~1.80 MiB) from the OS file cache, including disk I/O
dirtied: 0
writes: 0

Query 3: Get min id and max id for issues mentions to be migrated

explain SELECT min(“notes”.“id”), max(“notes”.“id”) FROM “notes”
LEFT JOIN issue_user_mentions ON notes.id = issue_user_mentions.note_id
WHERE (note LIKE ‘%@%’ AND issue_user_mentions.issue_id IS NULL
AND notes.noteable_type = ‘Issue’) AND notes.id >= 5450000 AND notes.id < 15450000

Click to see the plan: Time: 193.688 ms.


Aggregate  (cost=7911.40..7911.41 rows=1 width=8) (actual time=183.495..183.496 rows=1 loops=1)
   Buffers: shared hit=131342
   ->  Hash Left Join  (cost=4218.79..7911.40 rows=1 width=4) (actual time=24.989..157.330 rows=223363 loops=1)
         Hash Cond: (notes.id = issue_user_mentions.note_id)
         Filter: (issue_user_mentions.issue_id IS NULL)
         Rows Removed by Filter: 0
         Buffers: shared hit=131342
         ->  Index Only Scan using issue_notes_mentions_temp_index on public.notes  (cost=0.43..3219.34 rows=126245 width=4) (actual time=0.097..67.669 rows=223363 loops=1)
               Index Cond: ((notes.id >= 5450000) AND (notes.id < 15450000))
               Heap Fetches: 0
               Buffers: shared hit=130679
         ->  Hash  (cost=3341.49..3341.49 rows=70149 width=8) (actual time=24.079..24.079 rows=55724 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 3201kB
               Buffers: shared hit=660
               ->  Seq Scan on public.issue_user_mentions  (cost=0.00..3341.49 rows=70149 width=8) (actual time=0.026..12.027 rows=70348 loops=1)
                     Buffers: shared hit=660
Time: 193.688 ms

planning: 9.987 ms
execution: 183.701 ms

I/O read: 0.000 ms
I/O write: 0.000 ms



Shared buffers:

hits: 131342 (~1.00 GiB) from the buffer pool
reads: 0 from the OS file cache, including disk I/O
dirtied: 0
writes: 0

Query 4: Pull actual notes data for given range of records

explain SELECT notes.* FROM “notes”
LEFT JOIN issue_user_mentions ON notes.id = issue_user_mentions.note_id
WHERE (note LIKE ‘%@%’ AND issue_user_mentions.issue_id IS NULL
AND notes.noteable_type = ‘Issue’) AND notes.id >= 5450000 AND notes.id < 11730000

Click to see the plan: Time: 1.630 s.


Hash Join  (cost=7943.53..11548.10 rows=1 width=2460) (actual time=3071.085..3077.807 rows=10753 loops=1)
   Hash Cond: (issue_user_mentions.note_id = notes.id)
   Filter: (issue_user_mentions.issue_id IS NULL)
   Rows Removed by Filter: 0
   Buffers: shared hit=1981 read=6229
   I/O Timings: read=2968.413
   ->  Seq Scan on public.issue_user_mentions  (cost=0.00..3341.49 rows=70149 width=8) (actual time=0.030..12.875 rows=70348 loops=1)
         Buffers: shared hit=660
   ->  Hash  (cost=7821.66..7821.66 rows=9749 width=2460) (actual time=3035.438..3035.438 rows=10753 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 3628kB
         Buffers: shared hit=1318 read=6229
         I/O Timings: read=2968.413
         ->  Index Scan using issue_notes_mentions_temp_index on public.notes  (cost=0.43..7821.66 rows=9749 width=2460) (actual time=2.487..3016.317 rows=10753 loops=1)
               Index Cond: ((notes.id >= 5450000) AND (notes.id < 11730000))
               Buffers: shared hit=1318 read=6229
               I/O Timings: read=2968.413
Time: 3.089 s

planning: 9.873 ms
execution: 3.079 s

I/O read: 2.968 s
I/O write: 0.000 ms



Shared buffers:

hits: 1981 (~15.50 MiB) from the buffer pool
reads: 6229 (~48.70 MiB) from the OS file cache, including disk I/O
dirtied: 0
writes: 0

Query 5: Pull notes data for given range of reocards and filter out notes that point to non existent issues

EXPLAIN SELECT notes.id FROM notes
INNER JOIN issues ON issues.id = notes.noteable_id WHERE notes.id IN (SELECT notes.id FROM “notes”
LEFT JOIN issue_user_mentions ON notes.id = issue_user_mentions.note_id
WHERE (note LIKE ‘%@%’ AND issue_user_mentions.issue_id IS NULL
AND notes.noteable_type = ‘Issue’) AND notes.id >= 5450000 AND notes.id < 11730000)

Click to see the plan: Time: 1.630 s.


Nested Loop  (cost=3976.86..3982.83 rows=1 width=4) (actual time=61.879..1611.350 rows=10753 loops=1)
   Buffers: shared hit=101640 read=1248 dirtied=29
   I/O Timings: read=1388.652
   ->  Nested Loop  (cost=3976.42..3980.44 rows=1 width=8) (actual time=56.610..589.711 rows=10753 loops=1)
         Buffers: shared hit=59789 read=409
         I/O Timings: read=431.345
         ->  HashAggregate  (cost=3975.85..3975.86 rows=1 width=4) (actual time=53.358..56.886 rows=10753 loops=1)
               Group Key: notes_1.id
               Buffers: shared hit=6378
               ->  Hash Join  (cost=371.28..3975.85 rows=1 width=4) (actual time=45.325..48.211 rows=10753 loops=1)
                     Hash Cond: (issue_user_mentions.note_id = notes_1.id)
                     Filter: (issue_user_mentions.issue_id IS NULL)
                     Rows Removed by Filter: 0
                     Buffers: shared hit=6378
                     ->  Seq Scan on public.issue_user_mentions  (cost=0.00..3341.49 rows=70149 width=8) (actual time=0.093..13.385 rows=70348 loops=1)
                           Buffers: shared hit=660
                     ->  Hash  (cost=249.42..249.42 rows=9749 width=4) (actual time=11.846..11.847 rows=10753 loops=1)
                           Buckets: 16384  Batches: 1  Memory Usage: 507kB
                           Buffers: shared hit=5718
                           ->  Index Only Scan using issue_notes_mentions_temp_index on public.notes notes_1  (cost=0.43..249.42 rows=9749 width=4) (actual time=0.173..8.321 rows=10753 loops=1)
                                 Index Cond: ((notes_1.id >= 5450000) AND (notes_1.id < 11730000))
                                 Heap Fetches: 0
                                 Buffers: shared hit=5718
         ->  Index Scan using notes_pkey on public.notes  (cost=0.57..4.57 rows=1 width=8) (actual time=0.049..0.049 rows=1 loops=10753)
               Index Cond: (notes.id = notes_1.id)
               Buffers: shared hit=53411 read=409
               I/O Timings: read=431.345
   ->  Index Only Scan using issues_pkey on public.issues  (cost=0.44..2.38 rows=1 width=4) (actual time=0.094..0.094 rows=1 loops=10753)
         Index Cond: (issues.id = notes.noteable_id)
         Heap Fetches: 143
         Buffers: shared hit=41851 read=839 dirtied=29
         I/O Timings: read=957.307
Time: 1.630 s

planning: 16.775 ms
execution: 1.613 s

I/O read: 1.389 s
I/O write: 0.000 ms



Shared buffers:

hits: 101640 (~794.10 MiB) from the buffer pool
reads: 1248 (~9.80 MiB) from the OS file cache, including disk I/O
dirtied: 29 (~232.00 KiB)
writes: 0

Migration Runtime estimates:

Issue notes mentions migration

Concurrent Index creation: ~45 <-> 140 mins? On database-lab channel the index creation took 45 mins, and it was not run concurrently which on production will be yet much slower because we have to run the index creation concurently
~1796152d records in 10K batches results in ~180 background migration jobs. Scheduling a single job should take ~1s, so ~3-5 mins to schedule all jobs? This next estimate is a very rough approximation, as I was able to only run jobs locally which showed me a runtime foreach each job between 3-5 mins with 2 mins delay between jobs(for only ~500K user mentions from 1.5M notes), and where running times were between 2 and 4 mins to the end of migration. For production given ~15M records it may make sense to consider a considerable increase in job time towards the end of the migration.
180 jobs * 3 mins = 540 mins
180 jobs * 5 mins = 900 mins
180 jobs * 10 mins = 1800 mins

Would be great if we could run it on a comparable size DB.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

re #198337

Edited Jul 28, 2020 by Alexandru Croitor

Draft: Migrate mentions for issue notes to DB table