Skip to content

Add two partial nonunique indexes

What does this MR do?

Add two partial non-unique indexes on labels for [project_id, title] and [group_id, title].

This is the first part of the strategy for #30390 (comment 254445189). By adding the indexes, we can then run the query on production to identify how many duplicate rows have been created due to this race condition.

Using #database-lab to run (only the [title project_id] index):

explain SELECT L1.project_id, L1.title 
FROM labels L1 
WHERE (SELECT COUNT(*) 
  FROM labels L2 
  WHERE L1.project_id=L2.project_id AND L1.title=L2.title)>1;
Query plan without index
Seq Scan on labels l1  (cost=0.00..76555885.70 rows=3308220 width=13)
  Filter: ((SubPlan 1) > 1)
  SubPlan 1
    ->  Aggregate  (cost=7.64..7.65 rows=1 width=8)
          ->  Index Scan using index_labels_on_project_id on labels l2  (cost=0.43..7.64 rows=1 width=0)
                Index Cond: (l1.project_id = project_id)
                Filter: ((l1.title)::text = (title)::text)
Execution summary without index (3.965 min)
Time: 3.965 min
  - planning: 1.421 ms
  - execution: 3.965 min
    - I/O read: 2.451 min
    - I/O write: 2.447 s

Shared buffers:
  - hits: 83687304 (~638.50 GiB) from the buffer pool
  - reads: 156844 (~1.20 GiB) from the OS file cache, including disk I/O
  - dirtied: 4148 (~32.40 MiB)
  - writes: 26587 (~207.70 MiB)

Having created the index:

CREATE INDEX CONCURRENTLY index_labels_on_project_id_and_title 
ON labels(project_id, title) 
WHERE labels.group_id = null;
Query plan with index
Seq Scan on public.labels l1  (cost=0.00..76597123.74 rows=3310013 width=13) (actual time=208.064..79341.877 rows=906 loops=1)
   Filter: ((SubPlan 1) > 1)
   Rows Removed by Filter: 9929134
   Buffers: shared hit=83843340 read=93
   I/O Timings: read=7.654
   SubPlan 1
     ->  Aggregate  (cost=7.64..7.65 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=9930040)
           Buffers: shared hit=83719214
           ->  Index Scan using index_labels_on_project_id on public.labels l2  (cost=0.43..7.64 rows=1 width=0) (actual time=0.005..0.007 rows=1 loops=9930040)
                 Index Cond: (l1.project_id = l2.project_id)
                 Filter: ((l1.title)::text = (l2.title)::text)
                 Rows Removed by Filter: 11
                 Buffers: shared hit=83719214
Execution summary with index (1.322 min)
Time: 1.322 min
  - planning: 1.044 ms
  - execution: 1.322 min
    - I/O read: 7.654 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 83843340 (~639.70 GiB) from the buffer pool
  - reads: 93 (~744.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Still has a table scan and is using 639GB 😱 but since we are making a specific query that won't really be indexed again (rows with identical values in two columns) that might be unavoidable? 🤔

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Related to #30390 (closed)

Edited by Coung Ngo

Merge request reports