Efficient group level queries
Problem statement
The current database structure of GitLab does not support efficient Group level queries. The query performance often depends on the number of records indirectly associated with the group. Adding hierarchical group queries into the mix (include subgroups) makes the performance even worse. Note: the traversal_ids improvement will help a bit, but it does not address the main problem.
We have database tables that are growing significantly as time goes forward. This makes group level queries slower and slower.
Examples (group level features):
- Issues page
- Merge requests page
- Boards / Issue rebalancing
- Packages
- Analytics
Note: the Epics feature performs well on the group level because we have a relatively small number of epics records.
What's the common problem?
Most domain models (e.g.: Issue, MergeRequest) are associated with a project. The project is associated with a group. A group can have a parent group.
As an example, when looking for the most recent issues (maximum 20 records) within a Group, the database needs to do the following:
- Collect descendant groups (current group and descendants). Loads N + 1 group records.
- Collect projects within the loaded groups. For each group (N + 1 index lookups), load the projects (M).
- For each project, find all issues (M index lookups).
- Sort the records by
created_atin memory. - Return the first 20 records.
We have an index defined on the project_id and created_at columns which work well on the project level views. On the group level these queries will hardly benefit from the sorted property of the B-tree index because the filter on the project_id column is an array (IN). The database needs to do a nested loop to collect the relevant records and re-sort the items in memory.
Example query:
SELECT *
FROM issues
WHERE
project_id IN (all_projects_in_namespace subquery)
ORDER BY created_at DESC
LIMIT 20
Ideas
De-normalization
Adding traversal_ids or namespace_id or root_namespace_id to the issues and merge_requests tables would help to improve the query performance for the top-level group queries in some cases. Query performance in subgroups would stay the same.
Separate table
Solving the problem for all groups is costly. It will require us to store the same data multiple times and maintain consistency over time.
Implementation
- Determine all columns in the
issuestable that are queried in group-level features.- I'd omit the title and description columns for now since full-text search has its own problems in PostgreSQL.
- Create a new table
group_issuesusing the collected columns. Addgroup_idto the table (consider hash partitioning). - Set up synchronization to keep
group_issuestable up to date (issuestable is the source of truth). - Use background migration to populate the new table.
Note: The table would store the same issue records multiple times with different group_id values. In case the group (G1) has 2 ancestors, then the issues in the project (part of G1) would be stored 3 times (group_ids: G1 and the two ancestor groups).
The previous query could be rewritten like this:
SELECT *
FROM group_issues
WHERE
group_id = 9970
ORDER BY created_at DESC
LIMIT 20
With an index on group_id and created_at we'd get optimal execution plan.
Note: Additional JOINs might be needed for the issues table to make all the columns available. It has no significant performance impact.
Design
To keep issues and group_issues tables in sync, we can use database triggers. When updating an issue, the trigger would update N rows in the group_issues table. This could have a serious impact on the performance of the issues endpoints (expensive updates).
To address this problem we can defer the syncing of the changes to a recurring background job. Due to the background job, record changes (like adding a label to an issue) will not be visible in the group level features right away, there might be a few seconds delay until all the records are updated.
