Research traversal_ids for issues
During our research we found that the biggest bottleneck is to query all issues within a large hierarchy.
The query consists of two parts:
- Fetching all namespaces the user has access to within a hierarchy
- Fetching all issues within these namespaces
Fetching all issues within lots of namespaces is slow
Fetching all accessible namespaces is not a very performant query, but it got optimized over time and is not the reason why looking up the issues is so slow.
For example, when querying all issues within gitlab-org and using a static list of the 6000 namespaces: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/37962/commands/116127 the query is still slow and uses a large buffer.
Optimizing retrieval of issues
Adding indexes
Currently we only have namespace_id_and_iid as an index on issues. And we miss most index combinations on the issues table that we have on project_id. Adding these indexes will definitely help to get the buffer size down. But it might not be enough.
Denormalizing namespace hierarchies
We know that
- ElasticSearch is way faster and denormalized the
traversal_idson theissuestable - We found that
vulnerability_readshas denormalizedtraversal_idsand saw huge performance improvements &12372 (closed)
This brings us to this issue where we want to research if it makes sense to denormalize the traversal_ids into issues.
Research
Script
Script from @dgruzd to create traversal_ids on our replica:
-- First, add the traversal_ids column if it doesn't exist
ALTER TABLE issues ADD COLUMN traversal_ids bigint[] DEFAULT '{}'::bigint[] NOT NULL;
-- Then update all issues with traversal_ids
UPDATE issues
SET traversal_ids = namespaces.traversal_ids
FROM namespaces
WHERE issues.namespace_id = namespaces.id;
-- Create a btree index on the new column to improve query performance
CREATE INDEX idx_issues_on_traversal_ids ON issues USING btree (traversal_ids);
SELECT issues.*
FROM "issues"
WHERE traversal_ids[1] = 9970
ORDER BY created_at DESC, id DESC
LIMIT 21;
-- Test with another index
CREATE INDEX idx_issues_on_traversal_ids_and_created_at_and_id_and_state_id ON issues USING btree (traversal_ids, created_at, id, state_id);
-- Test it with a gin index
CREATE INDEX idx_issues_on_traversal_ids ON issues USING gin (traversal_ids);
Outcome on the full replica (@dgruzd)
TODO - still running
Outcome on only gitlab-org (@nicolasdular)
TODO - still running
UPDATE issues
SET traversal_ids = namespaces.traversal_ids
FROM namespaces
WHERE issues.namespace_id = namespaces.id
AND namespaces.traversal_ids @> ('{9970}');
Outcome on a smaller table (@nicolasdular)
TODO