Research traversal_ids for issues

During our research we found that the biggest bottleneck is to query all issues within a large hierarchy.

The query consists of two parts:

  1. Fetching all namespaces the user has access to within a hierarchy
  2. Fetching all issues within these namespaces

Fetching all issues within lots of namespaces is slow

Fetching all accessible namespaces is not a very performant query, but it got optimized over time and is not the reason why looking up the issues is so slow.

For example, when querying all issues within gitlab-org and using a static list of the 6000 namespaces: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/37962/commands/116127 the query is still slow and uses a large buffer.

Optimizing retrieval of issues

Adding indexes

Currently we only have namespace_id_and_iid as an index on issues. And we miss most index combinations on the issues table that we have on project_id. Adding these indexes will definitely help to get the buffer size down. But it might not be enough.

Denormalizing namespace hierarchies

We know that

  1. ElasticSearch is way faster and denormalized the traversal_ids on the issues table
  2. We found that vulnerability_reads has denormalized traversal_ids and saw huge performance improvements &12372 (closed)

This brings us to this issue where we want to research if it makes sense to denormalize the traversal_ids into issues.

Research

Script

Script from @dgruzd to create traversal_ids on our replica:

-- First, add the traversal_ids column if it doesn't exist
ALTER TABLE issues ADD COLUMN traversal_ids bigint[] DEFAULT '{}'::bigint[] NOT NULL;

-- Then update all issues with traversal_ids
UPDATE issues
SET traversal_ids = namespaces.traversal_ids
FROM namespaces
WHERE issues.namespace_id = namespaces.id;

-- Create a btree index on the new column to improve query performance
CREATE INDEX idx_issues_on_traversal_ids ON issues USING btree (traversal_ids);

SELECT issues.*
FROM "issues"
WHERE traversal_ids[1] = 9970
ORDER BY created_at DESC, id DESC
LIMIT 21;

-- Test with another index
CREATE INDEX idx_issues_on_traversal_ids_and_created_at_and_id_and_state_id ON issues USING btree (traversal_ids, created_at, id, state_id);

-- Test it with a gin index
CREATE INDEX idx_issues_on_traversal_ids ON issues USING gin (traversal_ids);

Outcome on the full replica (@dgruzd)

TODO - still running

Outcome on only gitlab-org (@nicolasdular)

TODO - still running

UPDATE issues
SET traversal_ids = namespaces.traversal_ids
FROM namespaces
WHERE issues.namespace_id = namespaces.id
AND namespaces.traversal_ids @> ('{9970}');

Outcome on a smaller table (@nicolasdular)

TODO

Edited by Max Orefice