Add an index on merge_trains that filters on status before target_branch
Summary
As noted in !128756 (comment 1504757814) and !123853 (comment 1450438093), the existing index on merge trains performs poorly for certain queries.
This is the index we have today:
CREATE INDEX index_for_status_per_branch_per_project ON merge_trains USING btree (target_project_id, target_branch, status);
And this is the query for which it performs poorly:
SELECT DISTINCT ON (target_branch) *
FROM "merge_trains"
WHERE
("merge_trains"."status" IN (0,2,3))
AND "merge_trains"."target_project_id" = ?
ORDER BY
"merge_trains"."target_branch" ASC,
"merge_trains"."id" ASC
The above query picks out the first active train car across all target_branch
for a given target_project_id
. The problem is that there are a large amount of inactive cars in each project, but only a small amount of active cars, and with the existing index we cannot effectively eliminate inactive cars early.
Expand to see an example query plan demonstrating the problem.
Unique (cost=3808.12..3808.13 rows=2 width=61) (actual time=47.384..47.387 rows=0 loops=1)
Buffers: shared hit=870 read=431 dirtied=2
I/O Timings: read=19.526 write=0.000
-> Sort (cost=3808.12..3808.13 rows=2 width=61) (actual time=47.383..47.385 rows=0 loops=1)
Sort Key: merge_trains.target_branch, merge_trains.id
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=870 read=431 dirtied=2
I/O Timings: read=19.526 write=0.000
-> Index Scan using index_for_status_per_branch_per_project on public.merge_trains (cost=0.43..3808.11 rows=2 width=61) (actual time=47.339..47.340 rows=0 loops=1)
Index Cond: ((merge_trains.target_project_id = 7764) AND (merge_trains.status = ANY ('{0,2,3}'::integer[])))
Buffers: shared hit=864 read=431 dirtied=2
I/O Timings: read=19.526 write=0.000
with stats:
Time: 49.708 ms
- planning: 2.179 ms
- execution: 47.529 ms
- I/O read: 19.526 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 870 (~6.80 MiB) from the buffer pool
- reads: 431 (~3.40 MiB) from the OS file cache, including disk I/O
- dirtied: 2 (~16.00 KiB)
- writes: 0
Improvements
A couple of possibilities (not mutually exclusive, we might want both):
Option 1: Add a new index that has status
before target_branch
:
CREATE INDEX index_merge_trains_on_target_project_id_and_status_and_target_branch ON merge_trains USING btree (target_project_id, status, target_branch);
Option 2: Create a partial index just for active train cars, since that's what we query for most of the time (and it will be much smaller in size):
CREATE INDEX index_for_active_merge_trains_in_projects_by_branch ON merge_trains USING btree (target_project_id, target_branch) WHERE (status IN (0,2,3));
Risks
Adding indexes can stress the database, but merge_trains
is probably fine because it's a premium feature and a very specific workflow that not everyone uses.