Skip to content

Automatically enable group hierarchy optimization

What does this MR do and why?

This MR implements a periodical CRON worker that enables the group hierarchy retrieval optimization for large group hierarchies automatically. The worker is behind a feature flag: #510967 (closed)

You can read about the optimization and the cache logic here: &11469 (closed)

How does it work:

  1. Iterate over all groups until time limit is reached.
  2. For each group (or subgroup), count the descendants (project or subgroup, it doesn't matter) until threshold is reached.
  3. If we found a big group, enable the optimization by creating/upserting a Namespaces::Descendants record (this record will be later processed by another worker for populating the cache).
  4. When it processed the whole namespaces table, start it again the scanning from beginning.

Enabling the optimization will not have immediate effect (the worker is safe to run). Altering the group hierarchy lookup queries will need to be enabled separately using the group_hierarchy_optimization feature flag which will be gradually.

  • Note 1: The thresholds were calculated using the PRD data, we plan to enable the optimization for about 3K groups.
  • Note 2: The worker does not disable already enabled groups, simply because the number of groups in a hierarchy rarely decreases.

I used this query on PG.ai to determine the count threshold:

select id, t.total
from namespaces o,
lateral (
select count(*) as total from namespaces where (traversal_ids @> ARRAY[o.id])
) t
WHERE o.type = 'Group'
order by 2 desc
limit 3000;

Database

Batching query:

SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = 'Group' AND (id > 29691) AND "namespaces"."id" >= 29692 AND "namespaces"."id" < 82362

https://console.postgres.ai/gitlab/gitlab-production-main/sessions/34785/commands/107629

Counter query (uses https://docs.gitlab.com/ee/development/database/batching_best_practices.html#batch-from-any-node-from-the-group-hierarchy):

https://console.postgres.ai/gitlab/gitlab-production-main/sessions/34785/commands/107630

UPSERT query:

It's very unlikely that we'll insert more than a few records per worker run.

 INSERT INTO "namespace_descendants" ("namespace_id") VALUES (4249178) ON CONFLICT  DO NOTHING RETURNING "namespace_id"

https://postgres.ai/console/gitlab/gitlab-production-main/sessions/34785/commands/107632

References

Please include cross links to any resources that are relevant to this MR. This will give reviewers and future readers helpful context to give an efficient review of the changes introduced.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Related to #509554 (closed)

Edited by Adam Hegyi

Merge request reports

Loading