Optimise Query that sorts columns of ProjectStatistics table for all projects of namespace

For adding feature for sorting all project_statistic display columns, multiple queries were proposed to be introduced as part of this MR. During MR review the Query were deemed non-optimal, and approaches to optimise the query via indexes did not yield results.

Query

    SELECT
        "projects".*,
        "project_statistics"."storage_size" AS project_statistics_storage_size 
    FROM
        "projects" 
    INNER JOIN
        "project_statistics" 
            ON "project_statistics"."project_id" = "projects"."id" 
    WHERE
        "projects"."namespace_id" IN (
            SELECT
                namespaces.traversal_ids[array_length(namespaces.traversal_ids,
                1)] AS id 
            FROM
                "namespaces" 
            WHERE
                "namespaces"."type" = 'Group' 
                AND (
                    traversal_ids @> (
                        '{4249178}'
                    )
                )
        ) 
    ORDER BY
        "project_statistics"."storage_size" DESC NULLS FIRST,
        "projects"."id" DESC

Query Plan:

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/18920/commands/62776

Summary of Analysis

Actual comment by @Quintasan here

The group id is 4249178 which we have used for testing corresponds to one of our customers with a big number of projects. Since there's a 1:1 relation between projects and project_statistics we need to load a huge chunk of data and then sort it which will time out. You can use the Explain FlameGraph to see which parts of the query cost the most and it's the Nested Loop and Sort nodes so fetching all the sub-projects is not the most expensive thing here.


Based on the designs provided in the parent epic I'm assuming the Storage project table can be accessed from a sub-group or a top-level group. We tested the query timings for the top-level group so it should perform reasonably better for sub-groups with a smaller number of projects. That said we can do two types of de-normalizations to speed this up:

  1. Add a top-level-group-id column which will help the accessing Storage project table view from the top level group scenario. This, however, will not work very well if there's a one subgroup which has majority of the projects
  2. Add traversal ids (array) column which should help queries on all levels

Full MR discussion thread with context:

!117524 (comment 1387523693)