Skip to content

Migrate mentions for commit notes to DB table

What does this MR do?

  1. Cleans commit_user_mentions records that have no actual mentions. Based on #database-lab there are currently 0 records, so that should be fast.

    explain DELETE FROM "commit_user_mentions"
    WHERE "commit_user_mentions"."mentioned_users_ids" IS NULL
    AND "commit_user_mentions"."mentioned_groups_ids" IS NULL
    AND "commit_user_mentions"."mentioned_projects_ids" IS NULL
    AND "commit_user_mentions"."id" >= 1 

    Click to see the plan.
    
      ModifyTable on public.commit_user_mentions  (cost=0.00..396.38 rows=85 width=6) (actual time=58.317..58.317 rows=0 loops=1)
         Buffers: shared hit=1 read=80 dirtied=14
         I/O Timings: read=18.096
         ->  Seq Scan on public.commit_user_mentions  (cost=0.00..396.38 rows=85 width=6) (actual time=58.288..58.288 rows=0 loops=1)
               Filter: ((commit_user_mentions.mentioned_users_ids IS NULL) AND (commit_user_mentions.mentioned_groups_ids IS NULL) AND (commit_user_mentions.mentioned_projects_ids IS NULL) AND (commit_user_mentions.id >= 1))
               Rows Removed by Filter: 5830
               Buffers: shared hit=1 read=80 dirtied=14
               I/O Timings: read=18.096
    Time: 58.535 ms
    - planning: 0.170 ms
    - execution: 58.365 ms
    - I/O read: 18.096 ms
    - I/O write: 0.000 ms
    Shared buffers:
    - hits: 1 (~8.00 KiB) from the buffer pool
    - reads: 80 (~640.00 KiB) from the OS file cache, including disk I/O
    - dirtied: 14 (~112.00 KiB)
    - writes: 0

  2. This MR contains migrations for commit notes mentions.

    These migrations are to be run after !19088 (merged) is merged and confirmed to be working fine on production

UPDATE: migration is on production and seemingly working.

This MR is based on initial migration that contains common code under `CreateResourceUserMention` to be used on all mentions migrations of other models, including `Commit`.

I've pulled some estimate counts of the records we might be looking at from slack `database-lab` channel.

Database benchmarks

Number of rows affected

  • Commit notes with mentions ~177788

    explain select id from notes where noteable_type = Commit and note like %@%
    
     Index Only Scan using commit_mentions_temp_index on public.notes  (cost=0.42..31898.56 rows=986504 width=4) (actual time=0.203..5144.721 rows=177788 loops=1)
       Heap Fetches: 4133
       Buffers: shared hit=73401 read=4987 dirtied=216
       I/O Timings: read=5011.186

Migrate commit notes mentions

Query plans for sql queries to migrate mentions for commit notes based on database-lab data.

  • Temporary index is now created as a single index for all noteable_types, rather than one per each noteable_type.

    CREATE INDEX CONCURRENTLY mentions_in_notes_temp_index ON notes (id, noteable_type) WHERE note ~~ '%@%'::text

    \di+ mentions_in_notes_temp_index

     List of relations
      Schema |             Name             | Type  | Owner  | Table |  Size  | Description 
     --------+------------------------------+-------+--------+-------+--------+-------------
      public | mentions_in_notes_temp_index | index | gitlab | notes | 813 MB | 
     (1 row)
  • Query 1: Get the start ID for the issues to be migrated. Time: 3.836 ms.
    explain SELECT  notes.id FROM notes
    LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id
    WHERE (note LIKE %@% AND commit_user_mentions.commit_id IS NULL
    AND notes.noteable_type = Commit) ORDER BY notes.id LIMIT 1

    Click to see the plan WITH index. Time: 3.718 ms https://explain.depesz.com/s/iCWw
    
    Limit  (cost=0.70..24943.43 rows=1 width=4) (actual time=0.374..0.375 rows=1 loops=1)
       Buffers: shared hit=3 read=4
       I/O Timings: read=0.297
       ->  Merge Join  (cost=0.70..24943.43 rows=1 width=4) (actual time=0.373..0.373 rows=1 loops=1)
             Filter: (commit_user_mentions.commit_id IS NULL)
             Rows Removed by Filter: 0
             Buffers: shared hit=3 read=4
             I/O Timings: read=0.297
             ->  Index Only Scan using commit_mentions_temp_index on public.notes  (cost=0.42..21960.30 rows=959901 width=4) (actual time=0.361..0.361 rows=1 loops=1)
                   Heap Fetches: 0
                   Buffers: shared read=4
                   I/O Timings: read=0.297
             ->  Index Scan using index_commit_user_mentions_on_note_id on public.commit_user_mentions  (cost=0.28..600.06 rows=5335 width=45) (actual time=0.008..0.008 rows=1 loops=1)
                   Buffers: shared hit=3
    Time: 3.718 ms
    
    planning: 3.319 ms
    execution: 0.399 ms
    I/O read: 0.297 ms
    I/O write: 0.000 ms
    
    Shared buffers:
    
    hits: 3 (~24.00 KiB) from the buffer pool
    reads: 4 (~32.00 KiB) from the OS file cache, including disk I/O
    dirtied: 0
    writes: 0
    

  • Query 2: Get max id for the 10K batch rows for issues to be migrated.
    explain SELECT notes.id FROM notes
    LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id
    WHERE (note LIKE %@% AND commit_user_mentions.commit_id IS NULL
    AND notes.noteable_type = Commit) AND notes.id >= 5450000 ORDER BY notes.id ASC LIMIT 1 OFFSET 10000

    Click to see the plan. Time: 42.215 ms. https://explain.depesz.com/s/pnLr
    
    Limit  (cost=34861.24..69721.78 rows=1 width=4) (actual time=38.042..38.045 rows=1 loops=1)
       Buffers: shared hit=4604 read=639
       I/O Timings: read=28.204
       ->  Merge Join  (cost=0.70..34861.24 rows=1 width=4) (actual time=0.264..37.223 rows=10001 loops=1)
             Filter: (commit_user_mentions.commit_id IS NULL)
             Rows Removed by Filter: 0
             Buffers: shared hit=4604 read=639
             I/O Timings: read=28.204
             ->  Index Only Scan using commit_mentions_temp_index on public.notes  (cost=0.42..31864.66 rows=970639 width=4) (actual time=0.252..34.289 rows=10001 loops=1)
                   Index Cond: (notes.id >= 5450000)
                   Heap Fetches: 114
                   Buffers: shared hit=4601 read=639
                   I/O Timings: read=28.204
             ->  Index Scan using index_commit_user_mentions_on_note_id on public.commit_user_mentions  (cost=0.28..560.34 rows=3796 width=45) (actual time=0.008..0.008 rows=1 loops=1)
                   Buffers: shared hit=3
    Time: 42.215 ms
    
    planning: 4.118 ms
    execution: 38.097 ms
    I/O read: 28.204 ms
    I/O write: 0.000 ms
    
    Shared buffers:
    
    hits: 4604 (~36.00 MiB) from the buffer pool
    reads: 639 (~5.00 MiB) from the OS file cache, including disk I/O
    dirtied: 0
    writes: 0
    

  • Query 3: Get min id and max id for issues mentions to be migrated
    explain SELECT min(notes.id), max(notes.id) FROM notes
    LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id
    WHERE (note LIKE %@% AND commit_user_mentions.commit_id IS NULL
    AND notes.noteable_type = Commit) AND notes.id >= 5450000 AND notes.id < 15450000

    Click to see the plan. Time: 168.444 ms. https://explain.depesz.com/s/YHW7
    
        Aggregate  (cost=8339.26..8339.27 rows=1 width=8) (actual time=165.263..165.263 rows=1 loops=1)
         Buffers: shared hit=2768 read=1714 dirtied=1
         I/O Timings: read=110.949
         ->  Hash Left Join  (cost=493.46..8339.26 rows=1 width=4) (actual time=3.737..163.955 rows=4269 loops=1)
               Hash Cond: (notes.id = commit_user_mentions.note_id)
               Filter: (commit_user_mentions.commit_id IS NULL)
               Rows Removed by Filter: 0
               Buffers: shared hit=2768 read=1714 dirtied=1
               I/O Timings: read=110.949
               ->  Index Only Scan using commit_mentions_temp_index on public.notes  (cost=0.56..7790.70 rows=14839 width=4) (actual time=1.193..158.082 rows=4269 loops=1)
                     Index Cond: ((notes.id >= 26730000) AND (notes.id < 31730000) AND (notes.noteable_type = 'Commit'::text))
                     Heap Fetches: 3
                     Buffers: shared hit=2680 read=1714 dirtied=1
                     I/O Timings: read=110.949
               ->  Hash  (cost=414.62..414.62 rows=6262 width=45) (actual time=2.531..2.531 rows=6299 loops=1)
                     Buckets: 8192  Batches: 1  Memory Usage: 538kB
                     Buffers: shared hit=88
                     ->  Seq Scan on public.commit_user_mentions  (cost=0.00..414.62 rows=6262 width=45) (actual time=0.010..1.299 rows=6299 loops=1)
                           Buffers: shared hit=88
    Time: 168.444 ms
    - planning: 3.128 ms
    - execution: 165.316 ms
    - I/O read: 110.949 ms
    - I/O write: 0.000 ms
    Shared buffers:
    - hits: 2768 (~21.60 MiB) from the buffer pool
    - reads: 1714 (~13.40 MiB) from the OS file cache, including disk I/O
    - dirtied: 1 (~8.00 KiB)
    - writes: 0

  • Query 4: Pull actual notes data for given range of records
    explain SELECT notes.* FROM "notes"
    LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id
    WHERE (note LIKE '%@%' AND commit_user_mentions.commit_id IS NULL
    AND notes.noteable_type = 'Commit') AND notes.id >= 5450000 AND notes.id < 21730000

    Click to see the plan. **Time: 12.889 s**. https://explain.depesz.com/s/0QJ
    
      Hash Join  (cost=2172.54..2610.66 rows=1 width=2467) (actual time=260.147..260.261 rows=157 loops=1)
         Hash Cond: (commit_user_mentions.note_id = notes.id)
         Filter: (commit_user_mentions.commit_id IS NULL)
         Rows Removed by Filter: 0
         Buffers: shared hit=5 read=290 dirtied=10
         I/O Timings: read=252.525
         ->  Seq Scan on public.commit_user_mentions  (cost=0.00..414.62 rows=6262 width=45) (actual time=0.015..23.113 rows=6299 loops=1)
               Buffers: shared hit=1 read=87 dirtied=10
               I/O Timings: read=20.928
         ->  Hash  (cost=2157.74..2157.74 rows=1184 width=2467) (actual time=236.084..236.084 rows=157 loops=1)
               Buckets: 2048  Batches: 1  Memory Usage: 146kB
               Buffers: shared hit=1 read=203
               I/O Timings: read=231.597
               ->  Index Scan using commit_mentions_temp_index on public.notes  (cost=0.56..2157.74 rows=1184 width=2467) (actual time=3.757..235.239 rows=157 loops=1)
                     Index Cond: ((notes.id >= 5450000) AND (notes.id < 11730000) AND ((notes.noteable_type)::text = 'Commit'::text))
                     Buffers: shared hit=1 read=203
                     I/O Timings: read=231.597
    Time: 12.889 s
    - planning: 3.111 ms
    - execution: 12.886 s
    - I/O read: 12.653 s
    - I/O write: 0.000 ms
    Shared buffers:
    - hits: 495 (~3.90 MiB) from the buffer pool
    - reads: 12663 (~98.90 MiB) from the OS file cache, including disk I/O
    - dirtied: 53 (~424.00 KiB)
    - writes: 0

explain SELECT notes.* FROM "notes"
  LEFT JOIN commit_user_mentions ON notes.id = commit_user_mentions.note_id
  WHERE (note LIKE '%@%' AND commit_user_mentions.commit_id IS NULL
  AND notes.noteable_type = 'Commit') AND notes.id >= 11800000 AND notes.id < 13000000

Click to see the plan for 1355 rows cold cache. **Time: 1.584 s**. https://explain.depesz.com/s/r9x8

Hash Left Join  (cost=601.12..7672.16 rows=1 width=2471) (actual time=6.620..1579.179 rows=1355 loops=1)
   Hash Cond: (notes.id = commit_user_mentions.note_id)
   Filter: (commit_user_mentions.commit_id IS NULL)
   Rows Removed by Filter: 0
   Buffers: shared hit=565 read=1247 dirtied=9
   I/O Timings: read=1542.118
   ->  Index Scan using mentions_in_notes_temp_index on public.notes  (cost=0.56..7057.16 rows=3849 width=2471) (actual time=2.391..1571.064 rows=1355 loops=1)
         Index Cond: ((notes.id >= 11800000) AND (notes.id < 13000000) AND ((notes.noteable_type)::text = 'Commit'::text))
         Buffers: shared hit=459 read=1247 dirtied=9
         I/O Timings: read=1542.118
   ->  Hash  (cost=502.47..502.47 rows=7847 width=45) (actual time=4.206..4.207 rows=7903 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 659kB
         Buffers: shared hit=106
         ->  Seq Scan on public.commit_user_mentions  (cost=0.00..502.47 rows=7847 width=45) (actual time=0.013..1.935 rows=7903 loops=1)
               Buffers: shared hit=106
Time: 1.584 s

planning: 3.810 ms
execution: 1.580 s

I/O read: 1.542 s
I/O write: 0.000 ms



Shared buffers:

hits: 565 (~4.40 MiB) from the buffer pool
reads: 1247 (~9.70 MiB) from the OS file cache, including disk I/O
dirtied: 9 (~72.00 KiB)
writes: 0

Migration Runtime estimates

[OLD] Commit notes mentions migration: 10K batches + 3 - 10 mins jobs + index
  1. Concurrent Index creation: ~55 <-> 120 mins? On database-lab channel the index creation took 55 mins, and it was not run concurrently which on production I assume it will be slower because we have to run the index creation concurrently.
  2. ~177788 records in 10K batches results in ~18 background migration jobs. Scheduling a single job given all the queries seems to run way below 100ms should take ~1s, so ~20-30 secs to schedule all 18 jobs.

This next estimate is a very rough approximation, as I was able to only run jobs locally which showed me a runtime foreach each job between 3-5 mins with 2 mins delay between jobs(for only ~500K user mentions from 1.5M notes), and where running times were between 2 and 4 mins to the end of migration. Taking same estimations for 18 jobs we got:

  1. 18 jobs * 3 mins = 54 mins
  2. 18 jobs * 5 mins = 90 mins
  3. 18 jobs * 10 mins = 180 mins
[Current] Commit notes mentions migration: 1K batches + 3 - 10 mins jobs, no index
  1. index will be pre-created in !26307 (merged) -> 0 mins
  2. ~177788 records in 1K batches results in ~180 jobs scheduled to run 3 mins apart. Jobs should still be scheduled in under 1 min.
  3. running jobs estimates
    1. 180 jobs * 3 mins = 540 mins -> this is probably most plausible now that we lowered the batch size and increased the scheduling delay
    2. 180 jobs * 5 mins = 900 mins -> less plausible, maybe possible for some jobs to the end of migration ? 🤔
    3. 180 jobs * 10 mins = 1800 mins -> unlikely

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

re #198326

Edited by Coung Ngo

Merge request reports