Efficient group level queries

Problem statement

The current database structure of GitLab does not support efficient Group level queries. The query performance often depends on the number of records indirectly associated with the group. Adding hierarchical group queries into the mix (include subgroups) makes the performance even worse. Note: the traversal_ids improvement will help a bit, but it does not address the main problem.

We have database tables that are growing significantly as time goes forward. This makes group level queries slower and slower.

Examples (group level features):

Issues page
Merge requests page
Boards / Issue rebalancing
Packages
Analytics

Note: the Epics feature performs well on the group level because we have a relatively small number of epics records.

What's the common problem?

Most domain models (e.g.: Issue, MergeRequest) are associated with a project. The project is associated with a group. A group can have a parent group.

As an example, when looking for the most recent issues (maximum 20 records) within a Group, the database needs to do the following:

Collect descendant groups (current group and descendants). Loads N + 1 group records.
Collect projects within the loaded groups. For each group (N + 1 index lookups), load the projects (M).
For each project, find all issues (M index lookups).
Sort the records by created_at in memory.
Return the first 20 records.

We have an index defined on the project_id and created_at columns which work well on the project level views. On the group level these queries will hardly benefit from the sorted property of the B-tree index because the filter on the project_id column is an array (IN). The database needs to do a nested loop to collect the relevant records and re-sort the items in memory.

Example query:

SELECT *
FROM issues
WHERE
project_id IN (all_projects_in_namespace subquery)
ORDER BY created_at DESC
LIMIT 20

Ideas

De-normalization

Adding traversal_ids or namespace_id or root_namespace_id to the issues and merge_requests tables would help to improve the query performance for the top-level group queries in some cases. Query performance in subgroups would stay the same.

Separate table

Solving the problem for all groups is costly. It will require us to store the same data multiple times and maintain consistency over time.

Implementation

Determine all columns in the issues table that are queried in group-level features.
- I'd omit the title and description columns for now since full-text search has its own problems in PostgreSQL.
Create a new table group_issues using the collected columns. Add group_id to the table (consider hash partitioning).
Set up synchronization to keep group_issues table up to date (issues table is the source of truth).
Use background migration to populate the new table.

Note: The table would store the same issue records multiple times with different group_id values. In case the group (G1) has 2 ancestors, then the issues in the project (part of G1) would be stored 3 times (group_ids: G1 and the two ancestor groups).

The previous query could be rewritten like this:

SELECT *
FROM group_issues
WHERE
group_id = 9970
ORDER BY created_at DESC
LIMIT 20

With an index on group_id and created_at we'd get optimal execution plan.

Note: Additional JOINs might be needed for the issues table to make all the columns available. It has no significant performance impact.

Design

To keep issues and group_issues tables in sync, we can use database triggers. When updating an issue, the trigger would update N rows in the group_issues table. This could have a serious impact on the performance of the issues endpoints (expensive updates).

To address this problem we can defer the syncing of the changes to a recurring background job. Due to the background job, record changes (like adding a label to an issue) will not be visible in the group level features right away, there might be a few seconds delay until all the records are updated.

Edited May 14, 2021 by Adam Hegyi