Optimise Query that sorts columns of ProjectStatistics table for all projects of namespace
For adding feature for sorting all project_statistic display columns, multiple queries were proposed to be introduced as part of this MR. During MR review the Query were deemed non-optimal, and approaches to optimise the query via indexes did not yield results.
Query
SELECT
"projects".*,
"project_statistics"."storage_size" AS project_statistics_storage_size
FROM
"projects"
INNER JOIN
"project_statistics"
ON "project_statistics"."project_id" = "projects"."id"
WHERE
"projects"."namespace_id" IN (
SELECT
namespaces.traversal_ids[array_length(namespaces.traversal_ids,
1)] AS id
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND (
traversal_ids @> (
'{4249178}'
)
)
)
ORDER BY
"project_statistics"."storage_size" DESC NULLS FIRST,
"projects"."id" DESC
Query Plan:
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/18920/commands/62776
Summary of Analysis
Actual comment by @Quintasan here
- the index is not being used because it doesn't match order parameters 1:1 which turned out not be the case: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/18927/commands/62809
- we tried moving the join column to the index and changing the ordering to
project_statistics.idcolumn but it resulted in a mild improvement: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/18927/commands/62812
The group id is 4249178 which we have used for testing corresponds to one of our customers with a big number of projects. Since there's a 1:1 relation between projects and project_statistics we need to load a huge chunk of data and then sort it which will time out. You can use the Explain FlameGraph to see which parts of the query cost the most and it's the Nested Loop and Sort nodes so fetching all the sub-projects is not the most expensive thing here.
Based on the designs provided in the parent epic I'm assuming the Storage project table can be accessed from a sub-group or a top-level group. We tested the query timings for the top-level group so it should perform reasonably better for sub-groups with a smaller number of projects. That said we can do two types of de-normalizations to speed this up:
- Add a top-level-group-id column which will help the
accessing Storage project table view from the top level groupscenario. This, however, will not work very well if there's a one subgroup which has majority of the projects - Add traversal ids (array) column which should help queries on all levels