charlie ablett requested to merge 30390-duplicate-group-label into master Jul 16, 2020

What does this MR do?

Similar to !21384 (merged) but for group labels. The logic is basically identical but now we do it for group labels!

Strategy is in this thread: !21384 (comment 259835409) but for group labels. 🙂

Remove all but one of identical group labels (that is, same everything, including colour, and with no relationships)
For any remaining collections of duplicate title + group, rename the ones that are duplicates with the label ID appended, and customers can rename them again according to their own needs.

Migrations' SQL:

UP migration:

> m = RemoveDuplicateLabelsFromGroup.new
=> #<RemoveDuplicateLabelsFromGroup:0x000055e3017f7b98 @connection=nil, @name="RemoveDuplicateLabelsFromGroup", @version=nil>
[5] pry(main)> m.up
  RemoveDuplicateLabelsFromGroup::Group Load (1.2ms)  SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' ORDER BY "namespaces"."id" ASC LIMIT 1
  RemoveDuplicateLabelsFromGroup::Group Load (0.4ms)  SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND "namespaces"."id" >= 22 ORDER BY "namespaces"."id" ASC LIMIT 1 OFFSET 10000
   (0.5ms)  SELECT MIN(id), MAX(id) FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND "namespaces"."id" >= 22
-- transaction()
   (0.2ms)  BEGIN
   (2.0ms)  WITH data AS ( SELECT labels.*, row_number() OVER (PARTITION BY labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color ORDER BY labels.id) AS row_number, 1 AS restore_action FROM labels WHERE labels.group_id BETWEEN 22 AND 52 AND NOT EXISTS (SELECT * FROM board_labels WHERE board_labels.label_id = labels.id) AND NOT EXISTS (SELECT * FROM label_links WHERE label_links.label_id = labels.id) AND NOT EXISTS (SELECT * FROM label_priorities WHERE label_priorities.label_id = labels.id) AND NOT EXISTS (SELECT * FROM lists WHERE lists.label_id = labels.id) AND NOT EXISTS (SELECT * FROM resource_label_events WHERE resource_label_events.label_id = labels.id) ) SELECT * FROM data WHERE row_number > 1;
   (0.2ms)  COMMIT
   -> 0.0029s
-- transaction()
   (0.1ms)  BEGIN
   (0.5ms)  WITH data AS ( SELECT *, substring(title from 1 for 245 - length(id::text)) || '_duplicate' || id::text as new_title, 2 AS restore_action, row_number() OVER (PARTITION BY group_id, title ORDER BY id) AS row_number FROM labels WHERE group_id BETWEEN 22 AND 52 ) SELECT * FROM data WHERE row_number > 1;
   (0.2ms)  COMMIT
   -> 0.0014s
=> nil

Down migration SQL

> m.down
  RemoveDuplicateLabelsFromGroup::BackupLabel Load (0.9ms)  SELECT "backup_labels"."id" FROM "backup_labels" WHERE (project_id IS NULL AND group_id IS NOT NULL) ORDER BY "backup_labels"."id" ASC LIMIT 1
=> nil

Migrations' logs:

== 20200716234259 RemoveDuplicateLabelsFromGroup: migrating ===================
-- transaction()
   -> 0.0181s
-- transaction()
   -> 0.0009s
== 20200716234259 RemoveDuplicateLabelsFromGroup: migrated (0.0740s) ==========

== 20200716234518 AddUniquenessIndexToLabelTitleAndGroup: migrating ===========
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:labels, [:group_id, :title], {:where=>"labels.project_id IS NULL", :unique=>true, :name=>"index_labels_on_group_id_and_title_unique", :algorithm=>:concurrently})
   -> 0.0044s
-- add_index(:labels, [:group_id, :title], {:where=>"labels.project_id IS NULL", :unique=>true, :name=>"index_labels_on_group_id_and_title_unique", :algorithm=>:concurrently})
   -> 0.0033s
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:labels, [:group_id, :title], {:name=>"index_labels_on_group_id_and_title", :algorithm=>:concurrently})
   -> 0.0029s
-- remove_index(:labels, {:name=>"index_labels_on_group_id_and_title", :algorithm=>:concurrently, :column=>[:group_id, :title]})
   -> 0.0046s
== 20200716234518 AddUniquenessIndexToLabelTitleAndGroup: migrated (0.0158s) ==

EXPLAIN plans

UP migrations

finding full duplicates (batched by 10000)

SQL

WITH data
AS (SELECT
  labels.*,
  ROW_NUMBER() OVER (PARTITION BY labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color ORDER BY labels.id) AS row_number,
  1 AS restore_action
FROM labels
WHERE labels.group_id BETWEEN 25000 AND 35000
AND NOT EXISTS (SELECT
  *
FROM board_labels
WHERE board_labels.label_id = labels.id)
AND NOT EXISTS (SELECT
  *
FROM label_links
WHERE label_links.label_id = labels.id)
AND NOT EXISTS (SELECT
  *
FROM label_priorities
WHERE label_priorities.label_id = labels.id)
AND NOT EXISTS (SELECT
  *
FROM lists
WHERE lists.label_id = labels.id)
AND NOT EXISTS (SELECT
  *
FROM resource_label_events
WHERE resource_label_events.label_id = labels.id))
SELECT
  *
FROM data
WHERE row_number > 1;

EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1651/commands/5526

 CTE Scan on data  (cost=739.62..741.85 rows=33 width=1173) (actual time=468.569..468.576 rows=0 loops=1)

   Filter: (data.row_number > 1)

   Rows Removed by Filter: 17

   Buffers: shared hit=1026 read=349 dirtied=8

   I/O Timings: read=458.749

   CTE data

     ->  WindowAgg  (cost=736.41..739.62 rows=99 width=99) (actual time=468.510..468.549 rows=17 loops=1)

           Buffers: shared hit=1026 read=349 dirtied=8

           I/O Timings: read=458.749

           ->  Sort  (cost=736.41..736.65 rows=99 width=87) (actual time=468.487..468.495 rows=17 loops=1)

                 Sort Key: labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color, labels.id

                 Sort Method: quicksort  Memory: 27kB

                 Buffers: shared hit=1026 read=349 dirtied=8

                 I/O Timings: read=458.749

                 ->  Nested Loop Anti Join  (cost=2.83..733.13 rows=99 width=87) (actual time=0.146..468.406 rows=17 loops=1)

                       Buffers: shared hit=1026 read=349 dirtied=8

                       I/O Timings: read=458.749

                       ->  Nested Loop Anti Join  (cost=2.54..558.58 rows=99 width=87) (actual time=0.129..465.746 rows=17 loops=1)

                             Buffers: shared hit=994 read=347 dirtied=8

                             I/O Timings: read=456.219

                             ->  Nested Loop Anti Join  (cost=2.13..380.28 rows=100 width=87) (actual time=0.108..462.132 rows=17 loops=1)

                                   Buffers: shared hit=946 read=344 dirtied=8

                                   I/O Timings: read=452.779

                                   ->  Nested Loop Anti Join  (cost=1.56..313.96 rows=102 width=87) (actual time=0.082..431.576 rows=30 loops=1)

                                         Buffers: shared hit=838 read=323 dirtied=8

                                         I/O Timings: read=422.641

                                         ->  Nested Loop Anti Join  (cost=0.99..251.96 rows=104 width=87) (actual time=0.058..224.633 rows=119 loops=1)

                                               Buffers: shared hit=416 read=182 dirtied=3

                                               I/O Timings: read=221.206

                                               ->  Index Scan using index_labels_on_group_id_and_project_id_and_title on public.labels  (cost=0.56..146.09 rows=109 width=87) (actual time=0.023..147.187 rows=149 loops=1)

                                                     Index Cond: ((labels.group_id >= 25000) AND (labels.group_id <= 35000))

                                                     Buffers: shared hit=28 read=113 dirtied=2

                                                     I/O Timings: read=146.115

                                               ->  Index Only Scan using index_lists_on_label_id on public.lists  (cost=0.43..2.00 rows=3 width=4) (actual time=0.515..0.515 rows=0 loops=149)

                                                     Index Cond: (lists.label_id = labels.id)

                                                     Heap Fetches: 3

                                                     Buffers: shared hit=388 read=69 dirtied=1

                                                     I/O Timings: read=75.091

                                         ->  Index Only Scan using index_resource_label_events_on_label_id_and_action on public.resource_label_events  (cost=0.57..8.88 rows=300 width=4) (actual time=1.736..1.736 rows=1 loops=119)

                                               Index Cond: (resource_label_events.label_id = labels.id)

                                               Heap Fetches: 5

                                               Buffers: shared hit=422 read=141 dirtied=3

                                               I/O Timings: read=201.435

                                   ->  Index Only Scan using index_label_links_on_label_id_and_target_type on public.label_links  (cost=0.57..36.69 rows=442 width=4) (actual time=1.016..1.016 rows=0 loops=30)

                                         Index Cond: (label_links.label_id = labels.id)

                                         Heap Fetches: 0

                                         Buffers: shared hit=108 read=21

                                         I/O Timings: read=30.137

                             ->  Index Only Scan using index_label_priorities_on_label_id on public.label_priorities  (cost=0.42..1.78 rows=1 width=4) (actual time=0.210..0.210 rows=0 loops=17)

                                   Index Cond: (label_priorities.label_id = labels.id)

                                   Heap Fetches: 0

                                   Buffers: shared hit=48 read=3

                                   I/O Timings: read=3.440

                       ->  Index Only Scan using index_board_labels_on_label_id on public.board_labels  (cost=0.29..1.75 rows=1 width=4) (actual time=0.154..0.154 rows=0 loops=17)

                             Index Cond: (board_labels.label_id = labels.id)

                             Heap Fetches: 0

                             Buffers: shared hit=32 read=2

                             I/O Timings: read=2.530

finding partial duplicates (batched by 10000)

SQL

WITH data AS (
SELECT *,
substring(title from 1 for 245 - length(id::text)) || '_duplicate' || id::text as new_title,
2 AS restore_action,
row_number() OVER (PARTITION BY group_id, title ORDER BY id) AS row_number
FROM labels
WHERE group_id BETWEEN 25000 AND 35000
AND project_id IS NULL
) SELECT * FROM data WHERE row_number > 1;

EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1651/commands/5527

 CTE Scan on data  (cost=6.23..6.30 rows=1 width=1205) (actual time=6.923..6.925 rows=0 loops=1)

   Filter: (data.row_number > 1)

   Rows Removed by Filter: 149

   Buffers: shared hit=136 read=4

   I/O Timings: read=5.906

   CTE data

     ->  WindowAgg  (cost=6.10..6.23 rows=3 width=131) (actual time=6.478..6.800 rows=149 loops=1)

           Buffers: shared hit=136 read=4

           I/O Timings: read=5.906

           ->  Sort  (cost=6.10..6.10 rows=3 width=87) (actual time=6.462..6.477 rows=149 loops=1)

                 Sort Key: labels.group_id, labels.title, labels.id

                 Sort Method: quicksort  Memory: 54kB

                 Buffers: shared hit=136 read=4

                 I/O Timings: read=5.906

                 ->  Index Scan using index_labels_on_group_id_and_title_with_null_project_id on public.labels  (cost=0.42..6.07 rows=3 width=87) (actual time=4.874..6.344 rows=149 loops=1)

                       Index Cond: ((labels.group_id >= 25000) AND (labels.group_id <= 35000))

                       Buffers: shared hit=136 read=4

                       I/O Timings: read=5.906

DOWN migrations

restore full duplicates (batch of 1000)

SQL

INSERT INTO labels
SELECT id, title, color, group_id, created_at, updated_at, template, description, description_html, type, cached_markdown_version FROM backup_labels
  WHERE backup_labels.id BETWEEN 25000 AND 25500
  AND backup_labels.group_id IS NOT NULL
  AND backup_labels.restore_action = 1

EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5513

 ModifyTable on public.labels  (cost=0.27..3.29 rows=1 width=103) (actual time=0.006..0.006 rows=0 loops=1)

   Buffers: shared hit=2

   ->  Index Scan using backup_labels_pkey on public.backup_labels  (cost=0.27..3.29 rows=1 width=103) (actual time=0.005..0.005 rows=0 loops=1)

         Index Cond: ((backup_labels.id >= 25000) AND (backup_labels.id <= 25500))

         Filter: ((backup_labels.group_id IS NOT NULL) AND (backup_labels.restore_action = 1))

         Rows Removed by Filter: 0

         Buffers: shared hit=2

restoring partial duplicates (batch of 1000)

SQL

WITH backups AS (
  SELECT id, title
  FROM backup_labels
  WHERE id BETWEEN 25000 AND 25500
  AND group_id IS NOT NULL
  AND restore_action = 2
) UPDATE labels SET title = backups.title
FROM backups
WHERE labels.id = backups.id;

EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5514

 ModifyTable on public.labels  (cost=3.73..6.77 rows=1 width=1144) (actual time=0.021..0.023 rows=0 loops=1)

   Buffers: shared hit=2

   CTE backups

     ->  Index Scan using backup_labels_pkey on public.backup_labels  (cost=0.27..3.29 rows=1 width=13) (actual time=0.019..0.019 rows=0 loops=1)

           Index Cond: ((backup_labels.id >= 25000) AND (backup_labels.id <= 25500))

           Filter: ((backup_labels.group_id IS NOT NULL) AND (backup_labels.restore_action = 2))

           Rows Removed by Filter: 0

           Buffers: shared hit=2

   ->  Nested Loop  (cost=0.43..3.47 rows=1 width=1144) (actual time=0.021..0.021 rows=0 loops=1)

         Buffers: shared hit=2

         ->  CTE Scan on backups  (cost=0.00..0.02 rows=1 width=1064) (actual time=0.020..0.021 rows=0 loops=1)

               Buffers: shared hit=2

         ->  Index Scan using labels_pkey on public.labels  (cost=0.43..3.45 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=0)

               Index Cond: (labels.id = backups.id)

Current situation (counts, etc)

As of 30 December 2020 (NZDT):

Finding full duplicates

Query string:

explain WITH data AS (
  SELECT labels.*,
  row_number() OVER (PARTITION BY labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color ORDER BY labels.id) AS row_number,
  1 AS restore_action
  FROM labels
  WHERE labels.group_id IS NOT NULL
  AND NOT EXISTS (SELECT * FROM board_labels WHERE board_labels.label_id = labels.id)
  AND NOT EXISTS (SELECT * FROM label_links WHERE label_links.label_id = labels.id)
  AND NOT EXISTS (SELECT * FROM label_priorities WHERE label_priorities.label_id = labels.id)
  AND NOT EXISTS (SELECT * FROM lists WHERE lists.label_id = labels.id)
  AND NOT EXISTS (SELECT * FROM resource_label_events WHERE resource_label_events.label_id = labels.id)
) SELECT * FROM data WHERE row_number > 1;

EXPLAIN (20 rows) - https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5515

 CTE Scan on data  (cost=524089.61..531661.67 rows=112179 width=1173) (actual time=61086.659..61245.647 rows=20 loops=1)

   Filter: (data.row_number > 1)

   Rows Removed by Filter: 67513

   Buffers: shared hit=2005049 read=206782 dirtied=3954

   I/O Timings: read=170287.331

   CTE data

     ->  WindowAgg  (cost=513152.19..524089.61 rows=336536 width=99) (actual time=61061.073..61183.455 rows=67533 loops=1)

           Buffers: shared hit=2005049 read=206782 dirtied=3954

           I/O Timings: read=170287.331

           ->  Sort  (cost=513152.19..513993.53 rows=336536 width=87) (actual time=61061.034..61073.107 rows=67533 loops=1)

                 Sort Key: labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color, labels.id

                 Sort Method: quicksort  Memory: 14977kB

                 Buffers: shared hit=2005049 read=206782 dirtied=3954

                 I/O Timings: read=170287.331

                 ->  Gather  (cost=241175.13..482257.51 rows=336536 width=87) (actual time=29608.865..60838.740 rows=67533 loops=1)

                       Workers Planned: 2

                       Workers Launched: 2

                       Buffers: shared hit=2005044 read=206782 dirtied=3954

                       I/O Timings: read=170287.331

                       ->  Nested Loop Anti Join  (cost=240175.13..447603.91 rows=140223 width=87) (actual time=29597.588..60838.929 rows=22511 loops=3)

                             Buffers: shared hit=2005044 read=206782 dirtied=3954

                             I/O Timings: read=170287.331

                             ->  Nested Loop Anti Join  (cost=240174.56..363885.78 rows=142173 width=87) (actual time=29586.309..52939.279 rows=28353 loops=3)

                                   Buffers: shared hit=1683166 read=181475 dirtied=3650

                                   I/O Timings: read=147851.112

                                   ->  Merge Anti Join  (cost=240174.00..278646.45 rows=144652 width=87) (actual time=29584.412..32949.205 rows=71699 loops=3)

                                         Merge Cond: (labels.id = board_labels.label_id)

                                         Buffers: shared hit=777121 read=116496 dirtied=2978

                                         I/O Timings: read=91446.576

                                         ->  Merge Anti Join  (cost=240173.71..277727.27 rows=144805 width=87) (actual time=29581.415..32822.881 rows=72536 loops=3)

                                               Merge Cond: (labels.id = label_priorities.label_id)

                                               Buffers: shared hit=758924 read=116370 dirtied=2917

                                               I/O Timings: read=91360.773

                                               ->  Merge Anti Join  (cost=240173.29..274496.79 rows=146065 width=87) (actual time=29555.502..32605.304 rows=76710 loops=3)

                                                     Merge Cond: (labels.id = lists.label_id)

                                                     Buffers: shared hit=730482 read=115822 dirtied=2786

                                                     I/O Timings: read=91147.726

                                                     ->  Sort  (cost=240172.86..240554.75 rows=152758 width=87) (actual time=29441.243..29487.040 rows=123657 loops=3)

                                                           Sort Key: labels.id

                                                           Sort Method: quicksort  Memory: 26157kB

                                                           Buffers: shared hit=225841 read=107043 dirtied=1139

                                                           I/O Timings: read=86089.752

                                                           ->  Parallel Index Scan using index_labels_on_group_id_and_project_id_and_title on public.labels  (cost=0.56..227019.72 rows=152758 width=87) (actual time=3.458..29285.461 rows=123657 loops=3)

                                                                 Index Cond: (labels.group_id IS NOT NULL)

                                                                 Buffers: shared hit=225829 read=107043 dirtied=1139

                                                                 I/O Timings: read=86089.752

                                                     ->  Index Only Scan using index_lists_on_label_id on public.lists  (cost=0.43..98460.01 rows=4956465 width=4) (actual time=0.063..2872.978 rows=1497815 loops=3)

                                                           Heap Fetches: 391012

                                                           Buffers: shared hit=504641 read=8779 dirtied=1647

                                                           I/O Timings: read=5057.974

                                               ->  Index Only Scan using index_label_priorities_on_label_id on public.label_priorities  (cost=0.42..2538.81 rows=125026 width=4) (actual time=0.059..168.848 rows=125994 loops=3)

                                                     Heap Fetches: 41291

                                                     Buffers: shared hit=28442 read=548 dirtied=131

                                                     I/O Timings: read=213.047

                                         ->  Index Only Scan using index_board_labels_on_label_id on public.board_labels  (cost=0.29..512.43 rows=17319 width=4) (actual time=0.044..85.116 rows=17348 loops=3)

                                               Heap Fetches: 26085

                                               Buffers: shared hit=18197 read=126 dirtied=61

                                               I/O Timings: read=85.803

                                   ->  Index Only Scan using index_resource_label_events_on_label_id_and_action on public.resource_label_events  (cost=0.57..6.89 rows=300 width=4) (actual time=0.278..0.278 rows=1 loops=215097)

                                         Index Cond: (resource_label_events.label_id = labels.id)

                                         Heap Fetches: 936

                                         Buffers: shared hit=906045 read=64979 dirtied=668

                                         I/O Timings: read=56404.537

                             ->  Index Only Scan using index_label_links_on_label_id_and_target_type on public.label_links  (cost=0.57..9.76 rows=442 width=4) (actual time=0.277..0.277 rows=0 loops=85060)

                                   Index Cond: (label_links.label_id = labels.id)

                                   Heap Fetches: 675

                                   Buffers: shared hit=321878 read=25307 dirtied=304

                                   I/O Timings: read=22436.218

Finding partial duplicates

Query string

explain WITH data AS (
SELECT *,
substring(title from 1 for 245 - length(id::text)) || '_duplicate' || id::text as new_title,
2 AS restore_action,
row_number() OVER (PARTITION BY group_id, title ORDER BY id) AS row_number
FROM labels
WHERE group_id IS NOT NULL
) SELECT * FROM data WHERE row_number > 1;

EXPLAIN (689 rows - partial + full) https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5516

 CTE Scan on data  (cost=279538.97..287787.89 rows=122206 width=1205) (actual time=897.513..2178.448 rows=689 loops=1)

   Filter: (data.row_number > 1)

   Rows Removed by Filter: 370281

   Buffers: shared hit=332148

   CTE data

     ->  WindowAgg  (cost=263041.11..279538.97 rows=366619 width=131) (actual time=884.745..1791.442 rows=370970 loops=1)

           Buffers: shared hit=332148

           ->  Sort  (cost=263041.11..263957.66 rows=366619 width=87) (actual time=884.721..928.598 rows=370970 loops=1)

                 Sort Key: labels.group_id, labels.title, labels.id

                 Sort Method: quicksort  Memory: 78220kB

                 Buffers: shared hit=332148

                 ->  Index Scan using index_labels_on_group_id_and_project_id_and_title on public.labels  (cost=0.56..229158.33 rows=366619 width=87) (actual time=0.026..463.853 rows=370970 loops=1)

                       Index Cond: (labels.group_id IS NOT NULL)

                       Buffers: shared hit=332148

Screenshots

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Related to #30390 (closed) and #227752 (closed)

Edited Dec 30, 2020 by charlie ablett

Resolve labels with duplicate title and group ID