Skip to content

Remove duplicate members before joining with users table

Reuben Pereira requested to merge 331837-modified-user-member-query into master

What does this MR do?

The members table can contain the same user_id multiple times if the user_id is part of multiple groups/projects. Currently, the SQL query generated by Group#billed_project_members joins the members and users tables. The duplicate user_ids in the members table can result in unnecessary memory usage.

This MR modifies Group#billed_project_members to filter out duplicate members rows first in a subquery, and the subquery is used as an IN condition to query the users table.

See #329936 (comment 581777031).

Original query

SELECT DISTINCT
    "members"."user_id"
FROM
    "members"
    LEFT OUTER JOIN "users" ON "members"."user_id" = "users"."id"
WHERE
    "members"."type" = 'ProjectMember'
    AND "members"."source_type" = 'Project'
    AND "users"."state" = 'active'
    AND "members"."requested_at" IS NULL
    AND "members"."invite_token" IS NULL
    AND (members.access_level > 5)
    AND ("users"."user_type" IS NULL
        OR "users"."user_type" != 6)
    AND "members"."source_id" IN (
        SELECT
            "projects"."id"
        FROM
            "projects"
            INNER JOIN "namespaces" ON "namespaces"."type" = 'Group'
                AND "namespaces"."id" = "projects"."namespace_id"
                AND "namespaces"."type" = 'Group'
        WHERE
            "projects"."namespace_id" IN ( WITH RECURSIVE "base_and_descendants" AS ((
                        SELECT
                            "namespaces".*
                        FROM
                            "namespaces"
                        WHERE
                            "namespaces"."type" = 'Group'
                            AND "namespaces"."id" = $id)
                    UNION (
                        SELECT
                            "namespaces".*
                        FROM
                            "namespaces",
                            "base_and_descendants"
                        WHERE
                            "namespaces"."type" = 'Group'
                            AND "namespaces"."parent_id" = "base_and_descendants"."id"))
                    SELECT
                        id
                    FROM
                        "base_and_descendants" AS "namespaces"))

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4336/commands/15054

Time: 6.720 min  
  - planning: 21.091 ms  
  - execution: 6.719 min  
    - I/O read: 5.711 min  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 14367071 (~109.60 GiB) from the buffer pool  
  - reads: 124161 (~970.00 MiB) from the OS file cache, including disk I/O  
  - dirtied: 28491 (~222.60 MiB)  
  - writes: 0  
  
Temp buffers:  
  - reads: 4980 (~38.90 MiB)  
  - writes: 4982 (~38.90 MiB)  

Modified query

SELECT DISTINCT
    "users"."id"
FROM
    "users"
WHERE ("users"."state" IN ('active'))
    AND ("users"."user_type" IS NULL
        OR "users"."user_type" != 6)
    AND "users"."id" IN ( SELECT DISTINCT
            "members"."user_id"
        FROM
            "members"
        WHERE
            "members"."type" = 'ProjectMember'
            AND "members"."source_type" = 'Project'
            AND "members"."requested_at" IS NULL
            AND "members"."invite_token" IS NULL
            AND (members.access_level > 5)
            AND "members"."source_id" IN (
                SELECT
                    "projects"."id"
                FROM
                    "projects"
                    INNER JOIN "namespaces" ON "namespaces"."type" = 'Group'
                        AND "namespaces"."id" = "projects"."namespace_id"
                        AND "namespaces"."type" = 'Group'
                WHERE
                    "projects"."namespace_id" IN ( WITH RECURSIVE "base_and_descendants" AS ((
                                SELECT
                                    "namespaces".*
                                FROM
                                    "namespaces"
                                WHERE
                                    "namespaces"."type" = 'Group'
                                    AND "namespaces"."id" = $id)
                            UNION (
                                SELECT
                                    "namespaces".*
                                FROM
                                    "namespaces",
                                    "base_and_descendants"
                                WHERE
                                    "namespaces"."type" = 'Group'
                                    AND "namespaces"."parent_id" = "base_and_descendants"."id"))
                            SELECT
                                id
                            FROM
                                "base_and_descendants" AS "namespaces")))

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4336/commands/15056

Time: 19.561 s
  - planning: 11.955 ms
  - execution: 19.549 s
    - I/O read: 11.309 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 2740836 (~20.90 GiB) from the buffer pool
  - reads: 124164 (~970.00 MiB) from the OS file cache, including disk I/O
  - dirtied: 28491 (~222.60 MiB)
  - writes: 0

Temp buffers:
  - reads: 4981 (~38.90 MiB)
  - writes: 4983 (~38.90 MiB)

Modified query with non_guests_only

SELECT DISTINCT
    "users"."id"
FROM
    "users"
WHERE ("users"."state" IN ('active'))
    AND ("users"."user_type" IS NULL
        OR "users"."user_type" != 6)
    AND "users"."id" IN ( SELECT DISTINCT
            "members"."user_id"
        FROM
            "members"
        WHERE
            "members"."type" = 'ProjectMember'
            AND "members"."source_type" = 'Project'
            AND "members"."requested_at" IS NULL
            AND "members"."invite_token" IS NULL
            AND (members.access_level > 5)
            AND (members.access_level > 10)
            AND "members"."source_id" IN (
                SELECT
                    "projects"."id"
                FROM
                    "projects"
                    INNER JOIN "namespaces" ON "namespaces"."type" = 'Group'
                        AND "namespaces"."id" = "projects"."namespace_id"
                        AND "namespaces"."type" = 'Group'
                WHERE
                    "projects"."namespace_id" IN ( WITH RECURSIVE "base_and_descendants" AS ((
                                SELECT
                                    "namespaces".*
                                FROM
                                    "namespaces"
                                WHERE
                                    "namespaces"."type" = 'Group'
                                    AND "namespaces"."id" = $id)
                            UNION (
                                SELECT
                                    "namespaces".*
                                FROM
                                    "namespaces",
                                    "base_and_descendants"
                                WHERE
                                    "namespaces"."type" = 'Group'
                                    AND "namespaces"."parent_id" = "base_and_descendants"."id"))
                            SELECT
                                id
                            FROM
                                "base_and_descendants" AS "namespaces")))

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4348/commands/15075

Time: 1.523 min  
  - planning: 9.039 ms  
  - execution: 1.523 min  
    - I/O read: 1.421 min  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 2740834 (~20.90 GiB) from the buffer pool  
  - reads: 124162 (~970.00 MiB) from the OS file cache, including disk I/O  
  - dirtied: 28491 (~222.60 MiB)  
  - writes: 0  
  
Temp buffers:  
  - reads: 4970 (~38.80 MiB)  
  - writes: 4972 (~38.80 MiB) 

Original query using traversal_ids

The linear traversal_ids query doesn't help much here: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4336/commands/15061

Time: 3.393 min  
  - planning: 11.414 ms  
  - execution: 3.393 min  
    - I/O read: 2.674 min  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 14362631 (~109.60 GiB) from the buffer pool  
  - reads: 123927 (~968.20 MiB) from the OS file cache, including disk I/O  
  - dirtied: 28491 (~222.60 MiB)  
  - writes: 0  
  
Temp buffers:  
  - reads: 4980 (~38.90 MiB)  
  - writes: 4982 (~38.90 MiB)  

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

Related to #331837, #329936 (closed), #328464 (closed)

Edited by Reuben Pereira

Merge request reports