Efficient group level queries (!62379) · Merge requests · GitLab.org / GitLab

Adam Hegyi requested to merge 330977-efficient-group-level-queries into master May 24, 2021

What does this MR do?

This MR demonstrates a faster way to get ordered results with large IN queries from the DB. (the code is a bit hacky ATM)

An example query where this optimization could be used:

SELECT issues.* 
FROM issues
WHERE
project_id IN (SELECT ...projects in groups query)
ORDER BY id DESC
LIMIT 20

Examples:

Paginating over records on the group level.
Getting the min or max value of a column on the group level.

Examples where this method won't work:

Count records within the group.

How does it work?

The new query can be split into 3 parts. I'm going to use the query mentioned above to describe the algorithm.

Part 1: CTE query to load all projects from the IN query. (array query)

SELECT "projects"."id" FROM "projects" WHERE "projects"."namespace_id" IN (SELECT "namespaces"."id" FROM "namespaces" WHERE (traversal_ids @> ('{9970}')));

This query loads all project ids from the group hierarchy. Ideally, we want to reduce the number of projects as much as possible to help the performance. All project-related access check can happen in this step.

Part 2: Initialize the data structures (initializer query)

We're going to use a recursive query to set up the row "structure". The first part of the recursive query is constructed the following way:

Convert the result of the previous CTE query to an array: project_ids (array)
For each project id, find the maximum issue id value and store them in an array: max_issue_ids (array)
Additionally, we return an "empty" row. NULL::issues

The query would look something like this:

SELECT NULL::issues AS records, project_ids, max_issue_ids FROM (query to do the array transformation)

Note: the queries are quite complex to grasp at first. You can see full examples in the depesz execution plan below.

Part 3: Find next item and replace the row in the array (collector query)

This part is responsible for collecting the rows for the result. Algorithm:

Find the maximum value and the position index from the previously constructed max_issue_ids array.
Using the array value (max_issue_ids[position]), find the next record (id < max_issue_ids[position] LIMIT 1).
Construct a new SELECT that replaces max_issue_ids[position] with `next_record.id
Repeat until reaching the LIMIT

SELECT (SELECT issues FROM issues WHERE id = max_issue_ids[position]), project_ids, max_issue_id[position - 1] || next_record.id || max_issue_id[position + 1]

Note: the first value of the SELECT is a "row", this column will contain the sorted records.

Part 1: Load N project ids (number of projects) - (index only scan)
Part 2: Load N issue ids - (index only scan)
Part 3: Load M (LIMIT) issue ids - (index only scan) + Load M rows - (index scan)

Costs for 1000 projects, loading 20 issues in order: 2020 "rows" with index only scans and 20 "rows" index scans.

The original query would probably load all issues in the group hierarchy:

1000 index only scans for project ids.
If we have 10K issues then it's 10K "rows" with index scan and an in-memory sort.

List Issues within a Group and its subgroups

The query is ordered by the following columns: ORDER BY "issues"."updated_at" DESC, "issues"."id" DESC
The original query is similar to the query on the group issues page (no permission check and state filter).

Results:

New query (cached, from PG.ai): 19ms, https://explain.depesz.com/s/eEIs
New query (uncached, from PRD replica): 840ms, https://explain.depesz.com/s/JiC2
Old query (cached, from PG.ai): 536ms, https://explain.depesz.com/s/O3TW
Old query (uncached, from PG.ai): 119s, https://explain.depesz.com/s/S6Tc
Old query (uncached, from PRD replica): 3.2s, https://explain.depesz.com/s/R87V

Determining the minimum or maximum relative position in a Group

The query is used in issue rebalancing (issue placement).

Results:

New query (cached, from PG.ai): 14ms, https://explain.depesz.com/s/WihL
Old query (cached, from PG.ai): 165ms, https://explain.depesz.com/s/uuS1X

Example snippet

Iterating over issues in gitlab-org ordered by relative position and id. Special keyset Order object is needed because we use nulls last ordering.

# covers the "idx_issues_on_project_id_and_rel_asc_and_id" btree (project_id, relative_position, id) index
order = Gitlab::Pagination::Keyset::Order.build([
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'relative_position',
    column_expression: Issue.arel_table[:relative_position],
    order_expression: Gitlab::Database.nulls_last_order('relative_position', 'ASC'),
    reversed_order_expression: Gitlab::Database.nulls_first_order('relative_position', 'DESC'),
    order_direction: :asc,
    nullable: :nulls_last,
    distinct: false,
  ),
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'id',
    order_expression: Issue.arel_table[:id].asc
  )
])

opts = {
  use_recursive_union_with_multi_index_scan: {
    array_scope: Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id),
    array_mapping_scope: -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) },
    finder_query: -> (_relative_position_expression, id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }
  }
}

scope = Issue.order(order).limit(20)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
  puts records.map { |r| [r.relative_position, r.id].inspect }
end

We request project ids in every iteration, we can avoid that if we build an "in-memory" table. This boosts the performance a bit:

order = Gitlab::Pagination::Keyset::Order.build([
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'relative_position',
    column_expression: Issue.arel_table[:relative_position],
    order_expression: Gitlab::Database.nulls_last_order('relative_position', 'ASC'),
    reversed_order_expression: Gitlab::Database.nulls_first_order('relative_position', 'DESC'),
    order_direction: :asc,
    nullable: :nulls_last,
    distinct: false,
  ),
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'id',
    order_expression: Issue.arel_table[:id].asc
  )
])

project_ids = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).pluck(:id).map { |id| "(#{id}) "}.join(',')
array_scope = Project.from("(VALUES #{project_ids}) ids (id)").select(:id)
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
finder_query = -> (__, id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }

opts = {
  use_recursive_union_with_multi_index_scan: {
    array_scope: array_scope,
    array_mapping_scope: array_mapping_scope,
    finder_query: finder_query
  }
}

scope = Issue.order(order).limit(20)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
    puts records.map { |r| [r.id].inspect }
end

How to construct the options

Consider the following group level query:

Issue.where(project_id: Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)).order(:id)

Extract the group level query, make sure only the id column is selected.

array_scope = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)

We need a way to "map" the projects.id value to the issues query. For this, the code will call a lambda where the number of arguments will be the same as the selected columns in the array_scope (id only in this case). This scope will be merged with another query by the RecursiveUnionWithMultiIndexScan class.

# id_expression is a safe SQL fragment so we need to use Arel to build an `eq` query
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }

We need to find the row based on the ORDER BY columns. Define a lambda that will receive SQL fragments referencing ORDER BY column values. We only order by id, so we have one parameter.

finder_query = -> (id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }

Final code:

array_scope = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
finder_query = -> (id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }

opts = {
  use_recursive_union_with_multi_index_scan: {
    array_scope: array_scope,
    array_mapping_scope: array_mapping_scope,
    finder_query: finder_query
  }
}

# needs project_id, id index
scope = Issue.order(:id).limit(20)

# iterate over the rows, optionally we could just paginate...
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
  puts records.map { |r| [r.id].inspect }
end

Does this MR meet the acceptance criteria?

Conformity

I have included a changelog entry, or it's not needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
I have tested this MR in all supported browsers, or it's not needed.
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Related to #330977 (closed)

Edited May 26, 2021 by Adam Hegyi

Efficient group level queries