Add an index on merge_trains that filters on status before target_branch

Summary

As noted in !128756 (comment 1504757814) and !123853 (comment 1450438093), the existing index on merge trains performs poorly for certain queries.

This is the index we have today:

CREATE INDEX index_for_status_per_branch_per_project ON merge_trains USING btree (target_project_id, target_branch, status);

And this is the query for which it performs poorly:

SELECT DISTINCT ON (target_branch) * 
FROM "merge_trains" 
WHERE 
  ("merge_trains"."status" IN (0,2,3)) 
  AND "merge_trains"."target_project_id" = ? 
ORDER BY 
  "merge_trains"."target_branch" ASC, 
  "merge_trains"."id" ASC

The above query picks out the first active train car across all target_branch for a given target_project_id. The problem is that there are a large amount of inactive cars in each project, but only a small amount of active cars, and with the existing index we cannot effectively eliminate inactive cars early.

Expand to see an example query plan demonstrating the problem.

 Unique  (cost=3808.12..3808.13 rows=2 width=61) (actual time=47.384..47.387 rows=0 loops=1)
   Buffers: shared hit=870 read=431 dirtied=2
   I/O Timings: read=19.526 write=0.000
   ->  Sort  (cost=3808.12..3808.13 rows=2 width=61) (actual time=47.383..47.385 rows=0 loops=1)
         Sort Key: merge_trains.target_branch, merge_trains.id
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=870 read=431 dirtied=2
         I/O Timings: read=19.526 write=0.000
         ->  Index Scan using index_for_status_per_branch_per_project on public.merge_trains  (cost=0.43..3808.11 rows=2 width=61) (actual time=47.339..47.340 rows=0 loops=1)
               Index Cond: ((merge_trains.target_project_id = 7764) AND (merge_trains.status = ANY ('{0,2,3}'::integer[])))
               Buffers: shared hit=864 read=431 dirtied=2
               I/O Timings: read=19.526 write=0.000

with stats:

Time: 49.708 ms
  - planning: 2.179 ms
  - execution: 47.529 ms
    - I/O read: 19.526 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 870 (~6.80 MiB) from the buffer pool
  - reads: 431 (~3.40 MiB) from the OS file cache, including disk I/O
  - dirtied: 2 (~16.00 KiB)
  - writes: 0

Improvements

A couple of possibilities (not mutually exclusive, we might want both):

Option 1: Add a new index that has status before target_branch:

CREATE INDEX index_merge_trains_on_target_project_id_and_status_and_target_branch ON merge_trains USING btree (target_project_id, status, target_branch);

Option 2: Create a partial index just for active train cars, since that's what we query for most of the time (and it will be much smaller in size):

CREATE INDEX index_for_active_merge_trains_in_projects_by_branch ON merge_trains USING btree (target_project_id, target_branch) WHERE (status IN (0,2,3));

Risks

Adding indexes can stress the database, but merge_trains is probably fine because it's a premium feature and a very specific workflow that not everyone uses.

Add an index on merge_trains that filters on status before target_branch

Summary

Improvements

Risks

Involved components

Optional: Intended side effects

Optional: Missing test coverage