Optimize group hierarchy retrieval in group-level queries
## Release note
We adding a new cross-stage database optimization for group-level features, that will improve the performance of the background aggregation for VSA and VSD.
## Problem statement
When querying items (issues, merge requests, epics, etc.) within a group hierarchy, the query always need to read the whole hierarchy from the `namespaces` table. For a fairly large hierarchy, this means several thousands of database rows. Some of these rows might not even contribute to the final results because some groups or projects have no issue records.
This epic aims to implement a generic solution that can speed up group-level queries by de-normalizing the `namespaces` table in a way that the cost of reading the hierarchy would be negligible.
## Current queries
**Group hierarchy query for `gitlab-org`:**
```sql
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{9970}'))
```
Reads about 7 megabytes of data, 730 rows.
**All projects in `gitlab-org`:**
```sql
SELECT
"projects"."id"
FROM
"projects"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (traversal_ids @> ('{9970}')))
```
Reads about 45 megabytes of data, 4000 rows.
## Where would this optimization help?
The de-normalization should improve the database queries that are looking up the group hierarchy or all projects in the hierarchy. Depending on the query, the improvements might not be noticeable.
A few examples where we would see significant speed up:
- Queries using the in-operator optimization.
- Group projects api, groups API (sorted by id).
- Background jobs, queries which needs to fetch all projects or groups in the hierarchy.
A few examples where we would see some speed up:
- Counting record in the hierarchy (not groups or projects), for example: `issues`
- Here majority of the I/O is spent reading all issues. Reading the group hierarchy and the projects is just a fraction of the total I/O.
- Group-level sorted and paginated query.
- When queries are not using the [in-operator optimization](https://docs.gitlab.com/ee/development/database/efficient_in_operator_queries.html)
- Same reason as above (counting), paginating group-level queries will likely read all rows and sort the results in memory.
Better query performance is not the only benefit of the de-normalized approach. We should see less traffic on the `projects` and `namespaces` table.
## Proposed high-level solution
- Collect descendant namespace ids, project ids in an array column per namespace.
- When the hierarchy changes, a low-latency background job re-calculates the array.
- Implement a specialized query that checks if the cache is stale or not. Depending on the status execute the optimized query.
- Periodically check for inconsistencies and fix them automatically.
- Implement a new module in `app/models/namespaces/traversal/` to use the de-normalized solution.
### Database table
De-normalized hierarchy table: `namespace_descendants`:
|column|type|description|
|-|-|-|
|`namespace_id`|bigint|ID of the namespace, primary key. We might not need FK here if we periodically clean up the inconsistent data.|
|`outdated`|boolean|One of the array column is out of date. Re-calculation is needed.|
|`traversal_ids`|boolean|Cache column for quickly accessing the ancestors. I'm not 100% sure we need it.|
|`descendant_group_ids`|bigint[]|All subgroup ids for the given namespace/group|
|`descendant_project_ids`|bigint[]|All project ids for the given group|
SQL schema:
```sql
CREATE TABLE namespace_descendants (
namespace_id bigint PRIMARY KEY NOT NULL,
outdated boolean NOT NULL DEFAULT true,
traversal_ids bigint[] NOT NULL DEFAULT ARRAY[]::bigint[],
descendant_group_ids bigint[] NOT NULL,
descendant_project_ids bigint[] NOT NULL,
descendant_project_pending_delete_ids bigint[] NOT NULL -- this could be an optimization for getting projects not pending delete: descendant_projects_ids - descendant_projects_pending_delete_ids
-- more arrays where ids are collected based on a different rule
);
CREATE INDEX index_on_namespace_id_outdated on namespace_descendants (namespace_id, outdated);
```
See the most recent schema here: https://gitlab.com/gitlab-org/gitlab/-/issues/428464
Preparing the data for all subgroups and projects from the `gitlab-org` group:
```sql
INSERT INTO namespace_descendants (namespace_id, outdated, descendant_group_ids, descendant_project_ids)
SELECT 9970,
FALSE,
(
-- subgroups
SELECT array_agg(namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] ORDER BY 1 ASC)
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND (traversal_ids @> ('{9970}'))
),
(
-- projects
SELECT array_agg(projects.id ORDER BY 1 ASC)
FROM "projects"
WHERE "projects"."namespace_id" IN
(SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND (traversal_ids @> ('{9970}')))
)
```
#### `self_and_descendant_ids` query
```sql
SELECT UNNEST(descendant_group_ids) AS id FROM namespace_descendants WHERE namespace_id = 9970 AND outdated is false
```
In case we want to paginate the query, we can easily do that with `LATERAL` select. The `ids` are already sorted.
```sql
SELECT namespaces.id, namespaces.name
FROM (
SELECT UNNEST(descendant_group_ids) AS id
FROM namespace_descendants
WHERE namespace_id = 9970
LIMIT 25
) namespace_ids,
LATERAL (
SELECT namespaces.*
FROM namespaces
WHERE namespaces.id = namespace_ids.id
LIMIT 1
) namespaces;
```
|query|buffers old|buffers de-normalized|buffer difference|
|-|-|-|-|
|Pluck ids|[778](https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72022)|[35](https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72024)|{+ -22x +}|
|Paginate by id|[781](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72026)|[163](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72028)|{+ -4.7x +}|
|Count|[365](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72031)|[35](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72032)|{+ -10.4x +}|
#### `all_projects` query
```sql
SELECT UNNEST(descendant_project_ids) AS id FROM namespace_descendants WHERE namespace_id = 9970 AND outdated IS NULL
```
|query|buffers old|buffers de-normalized|buffer difference|
|-|-|-|-|
|Pluck ids|[7002](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72037)|[37](https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72039)|{+ -200x +}|
|Count|[5311](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72041)|[37](https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72042)|{+ -151x +}|
|In-operator paginated issues|[37773](https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72046)|[14643](https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/22251/commands/72047)|{+ -2.5x +}|
Note: Buffers and the actual query runtime correlates but assuming a 100x reduction in buffer count won't necessarily mean 100x faster query execution time. With the de-normalized approach, the DB will need significantly lower I/O usage which should improve the uncached execution time significantly.
#### Further optimizations
- Issues or other resources might not be used in every project thus we could keep track of project ids with at least one issue. This can cause significant reduction of projects thus reduced I/O when looking for issues ([related research](https://gitlab.com/gitlab-org/gitlab/-/issues/419572#note_1526625247)).
- Keeping track of non-archived project ids can eliminate accessing the `projects` table. A lot's of our queries joins `projects` just to add the `archived=false` condition.
#### Addressing table growth
The `namespace_descendants` table does not fit the time-decay pattern but we can take some steps to keep the table small:
- Selectively enable the optimization for only a handful of namespaces.
- Reasoning: de-normalization might not be beneficial for 80% of our group hierarchies on GitLab.com.
- The downside of this approach is cognitive load and code complexity to decide which traversal method to use.
- Hash partition the table by `namespace_id`.
- This effectively creates N tables, with a large enough partition numbers we can easily accommodate 10x growth.
### Background worker and change detection
Change detection is a challenging task. When a new subgroup or project is added to the hierarchy, several rows needs to be updated. Collecting fresh data and updating the rows can take some time, it's unlikely that we can do this synchronously.
Currently changes to the hierarchy (`traversal_ids`) are queued in the `namespaces_sync_events` table via the `trigger_namespaces_traversal_ids_on_update` trigger. For consistent change detection, we could hook into the worker that processes this table.
**Scenario: A new subgroup was added**
Technique 1: Append
For each ancestor (maximum 20), append a new element to the `descendant_ids` array (when we're collecting namespaces).
Technique 2: Re-calculation
For each ancestor, iterate over the hierarchy and rewrite the whole `descendant_ids` columns. For iterating over the hierarchy efficiently, we can use the [depth-first iterator](https://docs.gitlab.com/ee/development/database/poc_tree_iterator.html).
#### Locking
Parallel processing of a namespace hierarchy must be prevented, thus we'll need to acquire a lock while processing the rows.
```ruby
new_namespace = Namespace.find(1)
NamespaceDescendant.transaction do
NamespaceDescendant.where(id: new_namespace.traversal_ids[0]).lock! # lock the top-level row
new_namespace.each_parent do |parent|
# calculate and update the NamespaceDescendant rows here
end
end
```
For large hierarchies, the looping within a transaction might take several seconds. Keeping the transaction open for so long is undesirable however, to achieve consistency, it's a must. To minimize the impact:
- Avoid adding foreign key to the `NamespaceDescendant` table so other records/tables are not affected by the long running transaction.
- Add mechanism to temporarily disable the background worker while the migration is running.
#### Background worker
The background worker that keeps the de-normalized `NamespaceDescendant` database table up to date have to run with very high priority and low latency. Parallel execution can be achieved if there is a way to split the changed/new namespaces in a reliable way. A technique that uses module (`%`) is described [here](https://gitlab.com/gitlab-org/gitlab/-/issues/421200#note_1558011412)
#### "Always" consistent
There is a way to be always "fully" consistent, when the group hierarchy changes via expiration. This can be done via a trigger function that marks all `namespace_descendants` records (maximum 20) outdated when something changes in the hierarchy.
Algorithm:
1. A new group was added to the `gitlab-org` group.
2. The DB trigger detects the change (`INSERT`).
3. The trigger locates the `traversal_ids` of the newly added group and for each id, marks the `namespace_descendants` records outdated.
4. Now, the `namespace_descendants` table acts as a queue, we can process records with the `outdated = true` filter.
By marking the de-normalized records outdated, we can write a new `self_and_descendant_ids` query where we first check if the de-normalized record is up to date. If not, fall back to the standard `traversal_ids` based lookup.
Example query:
```sql
WITH "denormalized_ids" AS MATERIALIZED
(SELECT unnest(descendant_group_ids) AS id -- look up descendants via the cache table
FROM "namespace_descendants"
WHERE "namespace_descendants"."outdated" = FALSE
AND "namespace_descendants"."namespace_id" = 95),
"consistent_ids" AS MATERIALIZED -- original traversal_ids based lookup
(SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND (traversal_ids @> ('{95}'))
AND (NOT EXISTS (TABLE denormalized_ids))) -- only execute it if the cache table is empty (outdated)
SELECT "namespaces"."id"
FROM (TABLE denormalized_ids
UNION ALL TABLE consistent_ids) AS namespaces(id)
```
The worst case occurs when a change happen to a large hierarchy where re-calculation of the cache can take several seconds via the BG job. During this period the queries using `self_and_descendant_ids` will automatically fall back to use the `traversal_ids` based finder.
epic