Efficient group level queries
What does this MR do?
This MR demonstrates a faster way to get ordered results with large IN
queries from the DB. (the code is a bit hacky ATM)
An example query where this optimization could be used:
SELECT issues.*
FROM issues
WHERE
project_id IN (SELECT ...projects in groups query)
ORDER BY id DESC
LIMIT 20
Examples:
- Paginating over records on the group level.
- Getting the min or max value of a column on the group level.
Examples where this method won't work:
- Count records within the group.
How does it work?
The new query can be split into 3 parts. I'm going to use the query mentioned above to describe the algorithm.
Part 1: CTE query to load all projects from the IN query. (array query)
SELECT "projects"."id" FROM "projects" WHERE "projects"."namespace_id" IN (SELECT "namespaces"."id" FROM "namespaces" WHERE (traversal_ids @> ('{9970}')));
This query loads all project ids from the group hierarchy. Ideally, we want to reduce the number of projects as much as possible to help the performance. All project-related access check can happen in this step.
Part 2: Initialize the data structures (initializer query)
We're going to use a recursive query to set up the row "structure". The first part of the recursive query is constructed the following way:
- Convert the result of the previous CTE query to an array: project_ids (array)
- For each project id, find the maximum issue id value and store them in an array: max_issue_ids (array)
- Additionally, we return an "empty" row. NULL::issues
The query would look something like this:
SELECT NULL::issues AS records, project_ids, max_issue_ids FROM (query to do the array transformation)
Note: the queries are quite complex to grasp at first. You can see full examples in the depesz execution plan below.
Part 3: Find next item and replace the row in the array (collector query)
This part is responsible for collecting the rows for the result. Algorithm:
- Find the maximum value and the position index from the previously constructed
max_issue_ids
array. - Using the array value (max_issue_ids[position]), find the next record (
id < max_issue_ids[position] LIMIT 1
). - Construct a new SELECT that replaces
max_issue_ids[position]
with `next_record.id - Repeat until reaching the LIMIT
SELECT (SELECT issues FROM issues WHERE id = max_issue_ids[position]), project_ids, max_issue_id[position - 1] || next_record.id || max_issue_id[position + 1]
Note: the first value of the SELECT is a "row", this column will contain the sorted records.
- Part 1: Load N project ids (number of projects) - (index only scan)
- Part 2: Load N issue ids - (index only scan)
- Part 3: Load M (LIMIT) issue ids - (index only scan) + Load M rows - (index scan)
Costs for 1000 projects, loading 20 issues in order: 2020 "rows" with index only scans and 20 "rows" index scans.
The original query would probably load all issues in the group hierarchy:
- 1000 index only scans for project ids.
- If we have 10K issues then it's 10K "rows" with index scan and an in-memory sort.
List Issues within a Group and its subgroups
- The query is ordered by the following columns:
ORDER BY "issues"."updated_at" DESC, "issues"."id" DESC
- The original query is similar to the query on the group issues page (no permission check and state filter).
Results:
- New query (cached, from PG.ai): 19ms, https://explain.depesz.com/s/eEIs
- New query (uncached, from PRD replica): 840ms, https://explain.depesz.com/s/JiC2
- Old query (cached, from PG.ai): 536ms, https://explain.depesz.com/s/O3TW
- Old query (uncached, from PG.ai): 119s, https://explain.depesz.com/s/S6Tc
- Old query (uncached, from PRD replica): 3.2s, https://explain.depesz.com/s/R87V
Determining the minimum or maximum relative position in a Group
- The query is used in issue rebalancing (issue placement).
Results:
- New query (cached, from PG.ai): 14ms, https://explain.depesz.com/s/WihL
- Old query (cached, from PG.ai): 165ms, https://explain.depesz.com/s/uuS1X
Example snippet
Iterating over issues in gitlab-org
ordered by relative position and id. Special keyset Order
object is needed because we use nulls last ordering.
# covers the "idx_issues_on_project_id_and_rel_asc_and_id" btree (project_id, relative_position, id) index
order = Gitlab::Pagination::Keyset::Order.build([
Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
attribute_name: 'relative_position',
column_expression: Issue.arel_table[:relative_position],
order_expression: Gitlab::Database.nulls_last_order('relative_position', 'ASC'),
reversed_order_expression: Gitlab::Database.nulls_first_order('relative_position', 'DESC'),
order_direction: :asc,
nullable: :nulls_last,
distinct: false,
),
Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
attribute_name: 'id',
order_expression: Issue.arel_table[:id].asc
)
])
opts = {
use_recursive_union_with_multi_index_scan: {
array_scope: Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id),
array_mapping_scope: -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) },
finder_query: -> (_relative_position_expression, id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }
}
}
scope = Issue.order(order).limit(20)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
puts records.map { |r| [r.relative_position, r.id].inspect }
end
We request project ids in every iteration, we can avoid that if we build an "in-memory" table. This boosts the performance a bit:
order = Gitlab::Pagination::Keyset::Order.build([
Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
attribute_name: 'relative_position',
column_expression: Issue.arel_table[:relative_position],
order_expression: Gitlab::Database.nulls_last_order('relative_position', 'ASC'),
reversed_order_expression: Gitlab::Database.nulls_first_order('relative_position', 'DESC'),
order_direction: :asc,
nullable: :nulls_last,
distinct: false,
),
Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
attribute_name: 'id',
order_expression: Issue.arel_table[:id].asc
)
])
project_ids = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).pluck(:id).map { |id| "(#{id}) "}.join(',')
array_scope = Project.from("(VALUES #{project_ids}) ids (id)").select(:id)
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
finder_query = -> (__, id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }
opts = {
use_recursive_union_with_multi_index_scan: {
array_scope: array_scope,
array_mapping_scope: array_mapping_scope,
finder_query: finder_query
}
}
scope = Issue.order(order).limit(20)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
puts records.map { |r| [r.id].inspect }
end
How to construct the options
Consider the following group level query:
Issue.where(project_id: Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)).order(:id)
- Extract the group level query, make sure only the
id
column is selected.
array_scope = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)
- We need a way to "map" the
projects.id
value to theissues
query. For this, the code will call a lambda where the number of arguments will be the same as the selected columns in thearray_scope
(id
only in this case). This scope will be merged with another query by theRecursiveUnionWithMultiIndexScan
class.
# id_expression is a safe SQL fragment so we need to use Arel to build an `eq` query
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
- We need to find the row based on the
ORDER BY
columns. Define a lambda that will receive SQL fragments referencingORDER BY
column values. We only order byid
, so we have one parameter.
finder_query = -> (id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }
Final code:
array_scope = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
finder_query = -> (id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }
opts = {
use_recursive_union_with_multi_index_scan: {
array_scope: array_scope,
array_mapping_scope: array_mapping_scope,
finder_query: finder_query
}
}
# needs project_id, id index
scope = Issue.order(:id).limit(20)
# iterate over the rows, optionally we could just paginate...
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
puts records.map { |r| [r.id].inspect }
end
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Related to #330977 (closed)