Remove duplicate members before joining with users table
What does this MR do?
The members
table can contain the same user_id
multiple times if the user_id
is part of multiple groups/projects. Currently, the SQL query generated by Group#billed_project_members
joins the members
and users
tables. The duplicate user_ids in the members table can result in unnecessary memory usage.
This MR modifies Group#billed_project_members
to filter out duplicate members
rows first in a subquery, and the subquery is used as an IN
condition to query the users table.
See #329936 (comment 581777031).
Original query
SELECT DISTINCT
"members"."user_id"
FROM
"members"
LEFT OUTER JOIN "users" ON "members"."user_id" = "users"."id"
WHERE
"members"."type" = 'ProjectMember'
AND "members"."source_type" = 'Project'
AND "users"."state" = 'active'
AND "members"."requested_at" IS NULL
AND "members"."invite_token" IS NULL
AND (members.access_level > 5)
AND ("users"."user_type" IS NULL
OR "users"."user_type" != 6)
AND "members"."source_id" IN (
SELECT
"projects"."id"
FROM
"projects"
INNER JOIN "namespaces" ON "namespaces"."type" = 'Group'
AND "namespaces"."id" = "projects"."namespace_id"
AND "namespaces"."type" = 'Group'
WHERE
"projects"."namespace_id" IN ( WITH RECURSIVE "base_and_descendants" AS ((
SELECT
"namespaces".*
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND "namespaces"."id" = $id)
UNION (
SELECT
"namespaces".*
FROM
"namespaces",
"base_and_descendants"
WHERE
"namespaces"."type" = 'Group'
AND "namespaces"."parent_id" = "base_and_descendants"."id"))
SELECT
id
FROM
"base_and_descendants" AS "namespaces"))
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4336/commands/15054
Time: 6.720 min
- planning: 21.091 ms
- execution: 6.719 min
- I/O read: 5.711 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 14367071 (~109.60 GiB) from the buffer pool
- reads: 124161 (~970.00 MiB) from the OS file cache, including disk I/O
- dirtied: 28491 (~222.60 MiB)
- writes: 0
Temp buffers:
- reads: 4980 (~38.90 MiB)
- writes: 4982 (~38.90 MiB)
Modified query
SELECT DISTINCT
"users"."id"
FROM
"users"
WHERE ("users"."state" IN ('active'))
AND ("users"."user_type" IS NULL
OR "users"."user_type" != 6)
AND "users"."id" IN ( SELECT DISTINCT
"members"."user_id"
FROM
"members"
WHERE
"members"."type" = 'ProjectMember'
AND "members"."source_type" = 'Project'
AND "members"."requested_at" IS NULL
AND "members"."invite_token" IS NULL
AND (members.access_level > 5)
AND "members"."source_id" IN (
SELECT
"projects"."id"
FROM
"projects"
INNER JOIN "namespaces" ON "namespaces"."type" = 'Group'
AND "namespaces"."id" = "projects"."namespace_id"
AND "namespaces"."type" = 'Group'
WHERE
"projects"."namespace_id" IN ( WITH RECURSIVE "base_and_descendants" AS ((
SELECT
"namespaces".*
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND "namespaces"."id" = $id)
UNION (
SELECT
"namespaces".*
FROM
"namespaces",
"base_and_descendants"
WHERE
"namespaces"."type" = 'Group'
AND "namespaces"."parent_id" = "base_and_descendants"."id"))
SELECT
id
FROM
"base_and_descendants" AS "namespaces")))
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4336/commands/15056
Time: 19.561 s
- planning: 11.955 ms
- execution: 19.549 s
- I/O read: 11.309 s
- I/O write: 0.000 ms
Shared buffers:
- hits: 2740836 (~20.90 GiB) from the buffer pool
- reads: 124164 (~970.00 MiB) from the OS file cache, including disk I/O
- dirtied: 28491 (~222.60 MiB)
- writes: 0
Temp buffers:
- reads: 4981 (~38.90 MiB)
- writes: 4983 (~38.90 MiB)
Modified query with non_guests_only
SELECT DISTINCT
"users"."id"
FROM
"users"
WHERE ("users"."state" IN ('active'))
AND ("users"."user_type" IS NULL
OR "users"."user_type" != 6)
AND "users"."id" IN ( SELECT DISTINCT
"members"."user_id"
FROM
"members"
WHERE
"members"."type" = 'ProjectMember'
AND "members"."source_type" = 'Project'
AND "members"."requested_at" IS NULL
AND "members"."invite_token" IS NULL
AND (members.access_level > 5)
AND (members.access_level > 10)
AND "members"."source_id" IN (
SELECT
"projects"."id"
FROM
"projects"
INNER JOIN "namespaces" ON "namespaces"."type" = 'Group'
AND "namespaces"."id" = "projects"."namespace_id"
AND "namespaces"."type" = 'Group'
WHERE
"projects"."namespace_id" IN ( WITH RECURSIVE "base_and_descendants" AS ((
SELECT
"namespaces".*
FROM
"namespaces"
WHERE
"namespaces"."type" = 'Group'
AND "namespaces"."id" = $id)
UNION (
SELECT
"namespaces".*
FROM
"namespaces",
"base_and_descendants"
WHERE
"namespaces"."type" = 'Group'
AND "namespaces"."parent_id" = "base_and_descendants"."id"))
SELECT
id
FROM
"base_and_descendants" AS "namespaces")))
https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4348/commands/15075
Time: 1.523 min
- planning: 9.039 ms
- execution: 1.523 min
- I/O read: 1.421 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 2740834 (~20.90 GiB) from the buffer pool
- reads: 124162 (~970.00 MiB) from the OS file cache, including disk I/O
- dirtied: 28491 (~222.60 MiB)
- writes: 0
Temp buffers:
- reads: 4970 (~38.80 MiB)
- writes: 4972 (~38.80 MiB)
Original query using traversal_ids
The linear traversal_ids query doesn't help much here: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/4336/commands/15061
Time: 3.393 min
- planning: 11.414 ms
- execution: 3.393 min
- I/O read: 2.674 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 14362631 (~109.60 GiB) from the buffer pool
- reads: 123927 (~968.20 MiB) from the OS file cache, including disk I/O
- dirtied: 28491 (~222.60 MiB)
- writes: 0
Temp buffers:
- reads: 4980 (~38.90 MiB)
- writes: 4982 (~38.90 MiB)
Screenshots (strongly suggested)
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) - I have tested this MR in all supported browsers, or it's not needed.
- I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
- Label as security and @ mention
@gitlab-com/gl-security/appsec
- The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- Security reports checked/validated by a reviewer from the AppSec team
Related to #331837, #329936 (closed), #328464 (closed)