Resolve labels with duplicate title and group ID
What does this MR do?
Similar to !21384 (merged) but for group labels. The logic is basically identical but now we do it for group labels!
Strategy is in this thread: !21384 (comment 259835409) but for group labels.
- Remove all but one of identical group labels (that is, same everything, including colour, and with no relationships)
- For any remaining collections of duplicate title + group, rename the ones that are duplicates with the label ID appended, and customers can rename them again according to their own needs.
Migrations' SQL:
UP migration:
> m = RemoveDuplicateLabelsFromGroup.new
=> #<RemoveDuplicateLabelsFromGroup:0x000055e3017f7b98 @connection=nil, @name="RemoveDuplicateLabelsFromGroup", @version=nil>
[5] pry(main)> m.up
RemoveDuplicateLabelsFromGroup::Group Load (1.2ms) SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' ORDER BY "namespaces"."id" ASC LIMIT 1
RemoveDuplicateLabelsFromGroup::Group Load (0.4ms) SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND "namespaces"."id" >= 22 ORDER BY "namespaces"."id" ASC LIMIT 1 OFFSET 10000
(0.5ms) SELECT MIN(id), MAX(id) FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND "namespaces"."id" >= 22
-- transaction()
(0.2ms) BEGIN
(2.0ms) WITH data AS ( SELECT labels.*, row_number() OVER (PARTITION BY labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color ORDER BY labels.id) AS row_number, 1 AS restore_action FROM labels WHERE labels.group_id BETWEEN 22 AND 52 AND NOT EXISTS (SELECT * FROM board_labels WHERE board_labels.label_id = labels.id) AND NOT EXISTS (SELECT * FROM label_links WHERE label_links.label_id = labels.id) AND NOT EXISTS (SELECT * FROM label_priorities WHERE label_priorities.label_id = labels.id) AND NOT EXISTS (SELECT * FROM lists WHERE lists.label_id = labels.id) AND NOT EXISTS (SELECT * FROM resource_label_events WHERE resource_label_events.label_id = labels.id) ) SELECT * FROM data WHERE row_number > 1;
(0.2ms) COMMIT
-> 0.0029s
-- transaction()
(0.1ms) BEGIN
(0.5ms) WITH data AS ( SELECT *, substring(title from 1 for 245 - length(id::text)) || '_duplicate' || id::text as new_title, 2 AS restore_action, row_number() OVER (PARTITION BY group_id, title ORDER BY id) AS row_number FROM labels WHERE group_id BETWEEN 22 AND 52 ) SELECT * FROM data WHERE row_number > 1;
(0.2ms) COMMIT
-> 0.0014s
=> nil
Down migration SQL
> m.down
RemoveDuplicateLabelsFromGroup::BackupLabel Load (0.9ms) SELECT "backup_labels"."id" FROM "backup_labels" WHERE (project_id IS NULL AND group_id IS NOT NULL) ORDER BY "backup_labels"."id" ASC LIMIT 1
=> nil
Migrations' logs:
== 20200716234259 RemoveDuplicateLabelsFromGroup: migrating ===================
-- transaction()
-> 0.0181s
-- transaction()
-> 0.0009s
== 20200716234259 RemoveDuplicateLabelsFromGroup: migrated (0.0740s) ==========
== 20200716234518 AddUniquenessIndexToLabelTitleAndGroup: migrating ===========
-- transaction_open?()
-> 0.0000s
-- index_exists?(:labels, [:group_id, :title], {:where=>"labels.project_id IS NULL", :unique=>true, :name=>"index_labels_on_group_id_and_title_unique", :algorithm=>:concurrently})
-> 0.0044s
-- add_index(:labels, [:group_id, :title], {:where=>"labels.project_id IS NULL", :unique=>true, :name=>"index_labels_on_group_id_and_title_unique", :algorithm=>:concurrently})
-> 0.0033s
-- transaction_open?()
-> 0.0000s
-- index_exists?(:labels, [:group_id, :title], {:name=>"index_labels_on_group_id_and_title", :algorithm=>:concurrently})
-> 0.0029s
-- remove_index(:labels, {:name=>"index_labels_on_group_id_and_title", :algorithm=>:concurrently, :column=>[:group_id, :title]})
-> 0.0046s
== 20200716234518 AddUniquenessIndexToLabelTitleAndGroup: migrated (0.0158s) ==
EXPLAIN plans
UP migrations
finding full duplicates (batched by 10000)
SQL
WITH data
AS (SELECT
labels.*,
ROW_NUMBER() OVER (PARTITION BY labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color ORDER BY labels.id) AS row_number,
1 AS restore_action
FROM labels
WHERE labels.group_id BETWEEN 25000 AND 35000
AND NOT EXISTS (SELECT
*
FROM board_labels
WHERE board_labels.label_id = labels.id)
AND NOT EXISTS (SELECT
*
FROM label_links
WHERE label_links.label_id = labels.id)
AND NOT EXISTS (SELECT
*
FROM label_priorities
WHERE label_priorities.label_id = labels.id)
AND NOT EXISTS (SELECT
*
FROM lists
WHERE lists.label_id = labels.id)
AND NOT EXISTS (SELECT
*
FROM resource_label_events
WHERE resource_label_events.label_id = labels.id))
SELECT
*
FROM data
WHERE row_number > 1;
EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1651/commands/5526
CTE Scan on data (cost=739.62..741.85 rows=33 width=1173) (actual time=468.569..468.576 rows=0 loops=1)
Filter: (data.row_number > 1)
Rows Removed by Filter: 17
Buffers: shared hit=1026 read=349 dirtied=8
I/O Timings: read=458.749
CTE data
-> WindowAgg (cost=736.41..739.62 rows=99 width=99) (actual time=468.510..468.549 rows=17 loops=1)
Buffers: shared hit=1026 read=349 dirtied=8
I/O Timings: read=458.749
-> Sort (cost=736.41..736.65 rows=99 width=87) (actual time=468.487..468.495 rows=17 loops=1)
Sort Key: labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color, labels.id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=1026 read=349 dirtied=8
I/O Timings: read=458.749
-> Nested Loop Anti Join (cost=2.83..733.13 rows=99 width=87) (actual time=0.146..468.406 rows=17 loops=1)
Buffers: shared hit=1026 read=349 dirtied=8
I/O Timings: read=458.749
-> Nested Loop Anti Join (cost=2.54..558.58 rows=99 width=87) (actual time=0.129..465.746 rows=17 loops=1)
Buffers: shared hit=994 read=347 dirtied=8
I/O Timings: read=456.219
-> Nested Loop Anti Join (cost=2.13..380.28 rows=100 width=87) (actual time=0.108..462.132 rows=17 loops=1)
Buffers: shared hit=946 read=344 dirtied=8
I/O Timings: read=452.779
-> Nested Loop Anti Join (cost=1.56..313.96 rows=102 width=87) (actual time=0.082..431.576 rows=30 loops=1)
Buffers: shared hit=838 read=323 dirtied=8
I/O Timings: read=422.641
-> Nested Loop Anti Join (cost=0.99..251.96 rows=104 width=87) (actual time=0.058..224.633 rows=119 loops=1)
Buffers: shared hit=416 read=182 dirtied=3
I/O Timings: read=221.206
-> Index Scan using index_labels_on_group_id_and_project_id_and_title on public.labels (cost=0.56..146.09 rows=109 width=87) (actual time=0.023..147.187 rows=149 loops=1)
Index Cond: ((labels.group_id >= 25000) AND (labels.group_id <= 35000))
Buffers: shared hit=28 read=113 dirtied=2
I/O Timings: read=146.115
-> Index Only Scan using index_lists_on_label_id on public.lists (cost=0.43..2.00 rows=3 width=4) (actual time=0.515..0.515 rows=0 loops=149)
Index Cond: (lists.label_id = labels.id)
Heap Fetches: 3
Buffers: shared hit=388 read=69 dirtied=1
I/O Timings: read=75.091
-> Index Only Scan using index_resource_label_events_on_label_id_and_action on public.resource_label_events (cost=0.57..8.88 rows=300 width=4) (actual time=1.736..1.736 rows=1 loops=119)
Index Cond: (resource_label_events.label_id = labels.id)
Heap Fetches: 5
Buffers: shared hit=422 read=141 dirtied=3
I/O Timings: read=201.435
-> Index Only Scan using index_label_links_on_label_id_and_target_type on public.label_links (cost=0.57..36.69 rows=442 width=4) (actual time=1.016..1.016 rows=0 loops=30)
Index Cond: (label_links.label_id = labels.id)
Heap Fetches: 0
Buffers: shared hit=108 read=21
I/O Timings: read=30.137
-> Index Only Scan using index_label_priorities_on_label_id on public.label_priorities (cost=0.42..1.78 rows=1 width=4) (actual time=0.210..0.210 rows=0 loops=17)
Index Cond: (label_priorities.label_id = labels.id)
Heap Fetches: 0
Buffers: shared hit=48 read=3
I/O Timings: read=3.440
-> Index Only Scan using index_board_labels_on_label_id on public.board_labels (cost=0.29..1.75 rows=1 width=4) (actual time=0.154..0.154 rows=0 loops=17)
Index Cond: (board_labels.label_id = labels.id)
Heap Fetches: 0
Buffers: shared hit=32 read=2
I/O Timings: read=2.530
finding partial duplicates (batched by 10000)
SQL
WITH data AS (
SELECT *,
substring(title from 1 for 245 - length(id::text)) || '_duplicate' || id::text as new_title,
2 AS restore_action,
row_number() OVER (PARTITION BY group_id, title ORDER BY id) AS row_number
FROM labels
WHERE group_id BETWEEN 25000 AND 35000
AND project_id IS NULL
) SELECT * FROM data WHERE row_number > 1;
EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1651/commands/5527
CTE Scan on data (cost=6.23..6.30 rows=1 width=1205) (actual time=6.923..6.925 rows=0 loops=1)
Filter: (data.row_number > 1)
Rows Removed by Filter: 149
Buffers: shared hit=136 read=4
I/O Timings: read=5.906
CTE data
-> WindowAgg (cost=6.10..6.23 rows=3 width=131) (actual time=6.478..6.800 rows=149 loops=1)
Buffers: shared hit=136 read=4
I/O Timings: read=5.906
-> Sort (cost=6.10..6.10 rows=3 width=87) (actual time=6.462..6.477 rows=149 loops=1)
Sort Key: labels.group_id, labels.title, labels.id
Sort Method: quicksort Memory: 54kB
Buffers: shared hit=136 read=4
I/O Timings: read=5.906
-> Index Scan using index_labels_on_group_id_and_title_with_null_project_id on public.labels (cost=0.42..6.07 rows=3 width=87) (actual time=4.874..6.344 rows=149 loops=1)
Index Cond: ((labels.group_id >= 25000) AND (labels.group_id <= 35000))
Buffers: shared hit=136 read=4
I/O Timings: read=5.906
DOWN migrations
restore full duplicates (batch of 1000)
SQL
INSERT INTO labels
SELECT id, title, color, group_id, created_at, updated_at, template, description, description_html, type, cached_markdown_version FROM backup_labels
WHERE backup_labels.id BETWEEN 25000 AND 25500
AND backup_labels.group_id IS NOT NULL
AND backup_labels.restore_action = 1
EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5513
ModifyTable on public.labels (cost=0.27..3.29 rows=1 width=103) (actual time=0.006..0.006 rows=0 loops=1)
Buffers: shared hit=2
-> Index Scan using backup_labels_pkey on public.backup_labels (cost=0.27..3.29 rows=1 width=103) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: ((backup_labels.id >= 25000) AND (backup_labels.id <= 25500))
Filter: ((backup_labels.group_id IS NOT NULL) AND (backup_labels.restore_action = 1))
Rows Removed by Filter: 0
Buffers: shared hit=2
restoring partial duplicates (batch of 1000)
SQL
WITH backups AS (
SELECT id, title
FROM backup_labels
WHERE id BETWEEN 25000 AND 25500
AND group_id IS NOT NULL
AND restore_action = 2
) UPDATE labels SET title = backups.title
FROM backups
WHERE labels.id = backups.id;
EXPLAIN plan https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5514
ModifyTable on public.labels (cost=3.73..6.77 rows=1 width=1144) (actual time=0.021..0.023 rows=0 loops=1)
Buffers: shared hit=2
CTE backups
-> Index Scan using backup_labels_pkey on public.backup_labels (cost=0.27..3.29 rows=1 width=13) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: ((backup_labels.id >= 25000) AND (backup_labels.id <= 25500))
Filter: ((backup_labels.group_id IS NOT NULL) AND (backup_labels.restore_action = 2))
Rows Removed by Filter: 0
Buffers: shared hit=2
-> Nested Loop (cost=0.43..3.47 rows=1 width=1144) (actual time=0.021..0.021 rows=0 loops=1)
Buffers: shared hit=2
-> CTE Scan on backups (cost=0.00..0.02 rows=1 width=1064) (actual time=0.020..0.021 rows=0 loops=1)
Buffers: shared hit=2
-> Index Scan using labels_pkey on public.labels (cost=0.43..3.45 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=0)
Index Cond: (labels.id = backups.id)
Current situation (counts, etc)
As of 30 December 2020 (NZDT):
Finding full duplicates
Query string:
explain WITH data AS (
SELECT labels.*,
row_number() OVER (PARTITION BY labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color ORDER BY labels.id) AS row_number,
1 AS restore_action
FROM labels
WHERE labels.group_id IS NOT NULL
AND NOT EXISTS (SELECT * FROM board_labels WHERE board_labels.label_id = labels.id)
AND NOT EXISTS (SELECT * FROM label_links WHERE label_links.label_id = labels.id)
AND NOT EXISTS (SELECT * FROM label_priorities WHERE label_priorities.label_id = labels.id)
AND NOT EXISTS (SELECT * FROM lists WHERE lists.label_id = labels.id)
AND NOT EXISTS (SELECT * FROM resource_label_events WHERE resource_label_events.label_id = labels.id)
) SELECT * FROM data WHERE row_number > 1;
EXPLAIN (20 rows) - https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5515
CTE Scan on data (cost=524089.61..531661.67 rows=112179 width=1173) (actual time=61086.659..61245.647 rows=20 loops=1)
Filter: (data.row_number > 1)
Rows Removed by Filter: 67513
Buffers: shared hit=2005049 read=206782 dirtied=3954
I/O Timings: read=170287.331
CTE data
-> WindowAgg (cost=513152.19..524089.61 rows=336536 width=99) (actual time=61061.073..61183.455 rows=67533 loops=1)
Buffers: shared hit=2005049 read=206782 dirtied=3954
I/O Timings: read=170287.331
-> Sort (cost=513152.19..513993.53 rows=336536 width=87) (actual time=61061.034..61073.107 rows=67533 loops=1)
Sort Key: labels.group_id, labels.title, labels.template, labels.description, labels.type, labels.color, labels.id
Sort Method: quicksort Memory: 14977kB
Buffers: shared hit=2005049 read=206782 dirtied=3954
I/O Timings: read=170287.331
-> Gather (cost=241175.13..482257.51 rows=336536 width=87) (actual time=29608.865..60838.740 rows=67533 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2005044 read=206782 dirtied=3954
I/O Timings: read=170287.331
-> Nested Loop Anti Join (cost=240175.13..447603.91 rows=140223 width=87) (actual time=29597.588..60838.929 rows=22511 loops=3)
Buffers: shared hit=2005044 read=206782 dirtied=3954
I/O Timings: read=170287.331
-> Nested Loop Anti Join (cost=240174.56..363885.78 rows=142173 width=87) (actual time=29586.309..52939.279 rows=28353 loops=3)
Buffers: shared hit=1683166 read=181475 dirtied=3650
I/O Timings: read=147851.112
-> Merge Anti Join (cost=240174.00..278646.45 rows=144652 width=87) (actual time=29584.412..32949.205 rows=71699 loops=3)
Merge Cond: (labels.id = board_labels.label_id)
Buffers: shared hit=777121 read=116496 dirtied=2978
I/O Timings: read=91446.576
-> Merge Anti Join (cost=240173.71..277727.27 rows=144805 width=87) (actual time=29581.415..32822.881 rows=72536 loops=3)
Merge Cond: (labels.id = label_priorities.label_id)
Buffers: shared hit=758924 read=116370 dirtied=2917
I/O Timings: read=91360.773
-> Merge Anti Join (cost=240173.29..274496.79 rows=146065 width=87) (actual time=29555.502..32605.304 rows=76710 loops=3)
Merge Cond: (labels.id = lists.label_id)
Buffers: shared hit=730482 read=115822 dirtied=2786
I/O Timings: read=91147.726
-> Sort (cost=240172.86..240554.75 rows=152758 width=87) (actual time=29441.243..29487.040 rows=123657 loops=3)
Sort Key: labels.id
Sort Method: quicksort Memory: 26157kB
Buffers: shared hit=225841 read=107043 dirtied=1139
I/O Timings: read=86089.752
-> Parallel Index Scan using index_labels_on_group_id_and_project_id_and_title on public.labels (cost=0.56..227019.72 rows=152758 width=87) (actual time=3.458..29285.461 rows=123657 loops=3)
Index Cond: (labels.group_id IS NOT NULL)
Buffers: shared hit=225829 read=107043 dirtied=1139
I/O Timings: read=86089.752
-> Index Only Scan using index_lists_on_label_id on public.lists (cost=0.43..98460.01 rows=4956465 width=4) (actual time=0.063..2872.978 rows=1497815 loops=3)
Heap Fetches: 391012
Buffers: shared hit=504641 read=8779 dirtied=1647
I/O Timings: read=5057.974
-> Index Only Scan using index_label_priorities_on_label_id on public.label_priorities (cost=0.42..2538.81 rows=125026 width=4) (actual time=0.059..168.848 rows=125994 loops=3)
Heap Fetches: 41291
Buffers: shared hit=28442 read=548 dirtied=131
I/O Timings: read=213.047
-> Index Only Scan using index_board_labels_on_label_id on public.board_labels (cost=0.29..512.43 rows=17319 width=4) (actual time=0.044..85.116 rows=17348 loops=3)
Heap Fetches: 26085
Buffers: shared hit=18197 read=126 dirtied=61
I/O Timings: read=85.803
-> Index Only Scan using index_resource_label_events_on_label_id_and_action on public.resource_label_events (cost=0.57..6.89 rows=300 width=4) (actual time=0.278..0.278 rows=1 loops=215097)
Index Cond: (resource_label_events.label_id = labels.id)
Heap Fetches: 936
Buffers: shared hit=906045 read=64979 dirtied=668
I/O Timings: read=56404.537
-> Index Only Scan using index_label_links_on_label_id_and_target_type on public.label_links (cost=0.57..9.76 rows=442 width=4) (actual time=0.277..0.277 rows=0 loops=85060)
Index Cond: (label_links.label_id = labels.id)
Heap Fetches: 675
Buffers: shared hit=321878 read=25307 dirtied=304
I/O Timings: read=22436.218
Finding partial duplicates
Query string
explain WITH data AS (
SELECT *,
substring(title from 1 for 245 - length(id::text)) || '_duplicate' || id::text as new_title,
2 AS restore_action,
row_number() OVER (PARTITION BY group_id, title ORDER BY id) AS row_number
FROM labels
WHERE group_id IS NOT NULL
) SELECT * FROM data WHERE row_number > 1;
EXPLAIN (689 rows - partial + full) https://postgres.ai/console/gitlab/gitlab-production-tunnel/sessions/1648/commands/5516
CTE Scan on data (cost=279538.97..287787.89 rows=122206 width=1205) (actual time=897.513..2178.448 rows=689 loops=1)
Filter: (data.row_number > 1)
Rows Removed by Filter: 370281
Buffers: shared hit=332148
CTE data
-> WindowAgg (cost=263041.11..279538.97 rows=366619 width=131) (actual time=884.745..1791.442 rows=370970 loops=1)
Buffers: shared hit=332148
-> Sort (cost=263041.11..263957.66 rows=366619 width=87) (actual time=884.721..928.598 rows=370970 loops=1)
Sort Key: labels.group_id, labels.title, labels.id
Sort Method: quicksort Memory: 78220kB
Buffers: shared hit=332148
-> Index Scan using index_labels_on_group_id_and_project_id_and_title on public.labels (cost=0.56..229158.33 rows=366619 width=87) (actual time=0.026..463.853 rows=370970 loops=1)
Index Cond: (labels.group_id IS NOT NULL)
Buffers: shared hit=332148
Screenshots
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Related to #30390 (closed) and #227752 (closed)
Edited by charlie ablett