Skip to content

Optimize top-bound lineage search to use index-only scan

Brian Williams requested to merge bwill/optimize-top-bound-lineage-search into master

What does this MR do and why?

During backend pair programming (timestamp 15:54), I demonstrated a method of reading a namespace hierarchy in a linear manner by using >= and < operators on traversal_ids arrays. It was noted that this is different from the current query which uses @>, and I kinda glossed over it saying that they do the same thing. While both methods do have the same outcome, they actually achieve them in different ways and I was curious which one was better. So, I tested both methods in DB lab and compared them. >= and < can work with b-tree indices, while @> must use a gin index. When we use a b-tree index, we can perform an index-only scan which saves us a few MB of memory.

How the current method (@>) works

The current method uses the clause WHERE traversal_ids @> '{namespace_id}'. This means "find all traversal_ids that contain namespace_id". It's equivalent to the following ruby code:

namespaces.find { |namespace| namespace.traversal_ids.include?(id) }

Given a hierarchy like the following:

 id  | traversal_ids 
-----+---------------
 108 | {108}
 109 | {108,109}
 110 | {108,109,110}
 111 | {108,111}
 112 | {108,111,112}
 113 | {108,113}
 114 | {108,113,114}

To find all namespaces under 108, we look for arrays containing 108. To find namespaces under 109, we look for arrays containing 109, and so on.

This works because each namespace can only have one parent and so we don't need to worry about something like {999, 109} appearing in the table. However, it means that we need to search the whole traveral_ids array for each row.

How the new method (>= AND <) works

The new method uses the clause WHERE traversal_ids >= '{traversal_ids}' AND traversal_ids < {traversal_ids_with_rightmost_number_incremented}. Postgres knows how to compare arrays and will do so by comparing each item from left-to-right. So, when we consider a table like this:

 id  | traversal_ids 
-----+---------------
 81  | {81}
 83  | {81,83}
 85  | {81,83,85}
 87  | {81,83,84,87}
 88  | {81,83,85,88}
 84  | {81,83,84}
 108 | {108}
 109 | {108,109}
 110 | {108,109,110}
 111 | {108,111}
 112 | {108,111,112}
 113 | {108,113}
 114 | {108,113,114}

If we SELECT * FROM namespaces WHERE traversal_ids >= '{108}' AND traversal_ids < '{109}'; Postgres will check the first element of every array in the traversal_ids columns and read it if it is 108 and skip it if it is not.

If we SELECT * FROM namespaces WHERE traversal_ids >= '{108, 109}' AND traversal_ids < '{108, 110}';, then Postgres will do the following:

  1. Check if the first element is 108, skip the row if not
  2. Check if the second element is 109, skip the row if not
  3. Read the row

This is faster because when checking row 88, we can skip that row after checking the first element, while @> will need to check the entire array.

Query comparisons

Description Query Postgres.ai link Execution time Buffer Reads
gitlab-org before SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{9970}'; https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82646 702.698 ms (684.318 ms on I/O reads) ~7.00 MiB
gitlab-org after SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{9970}' AND traversal_ids < '{9971}'; https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82648 27.121 ms (23.910 ms on I/O reads) ~1.50 MiB
gitlab-org/govern before SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{11787569}'; https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82663 1.044 s (1.022 s on I/O reads) ~2.20 MiB
gitlab-org/govern after SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{9970, 11787569}' AND traversal_ids < '{9970, 11787570}'; https://postgres.ai/console/gitlab/gitlab-production-main/sessions/26319/commands/82666 108.029 ms (102.110 ms on I/O reads) ~496.00 KiB
4249178 (customer with deep hierarchy) before SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{4249178}'; https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82664 15.316 s (14.797 s on I/O reads) ~61.50 MiB
4249178 (customer with deep hierarchy) after SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{4249178}' AND traversal_ids < '{4249179}'; https://postgres.ai/console/gitlab/gitlab-production-main/sessions/26319/commands/82667 3.883 s (3.750 s on I/O reads) ~11.90 MiB
5778500 (customer with many projects) before SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids @> '{5778500}'; https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/26298/commands/82665 2.123 s (2.064 s on I/O reads) ~7.50 MiB
5778500 (customer with many projects) after SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id FROM namespaces WHERE namespaces.type = 'Group' AND traversal_ids >= '{5778500}' AND traversal_ids < '{5778501}'; https://postgres.ai/console/gitlab/gitlab-production-main/sessions/26319/commands/82669 487.244 ms (476.877 ms on I/O reads) ~1.70 MiB

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Brian Williams

Merge request reports