Skip to content

Efficient group level queries

Adam Hegyi requested to merge 330977-efficient-group-level-queries into master

What does this MR do?

This MR demonstrates a faster way to get ordered results with large IN queries from the DB. (the code is a bit hacky ATM)

An example query where this optimization could be used:

SELECT issues.* 
FROM issues
WHERE
project_id IN (SELECT ...projects in groups query)
ORDER BY id DESC
LIMIT 20

Examples:

  • Paginating over records on the group level.
  • Getting the min or max value of a column on the group level.

Examples where this method won't work:

  • Count records within the group.

How does it work?

The new query can be split into 3 parts. I'm going to use the query mentioned above to describe the algorithm.

Part 1: CTE query to load all projects from the IN query. (array query)

SELECT "projects"."id" FROM "projects" WHERE "projects"."namespace_id" IN (SELECT "namespaces"."id" FROM "namespaces" WHERE (traversal_ids @> ('{9970}')));

This query loads all project ids from the group hierarchy. Ideally, we want to reduce the number of projects as much as possible to help the performance. All project-related access check can happen in this step.

Part 2: Initialize the data structures (initializer query)

We're going to use a recursive query to set up the row "structure". The first part of the recursive query is constructed the following way:

  • Convert the result of the previous CTE query to an array: project_ids (array)
  • For each project id, find the maximum issue id value and store them in an array: max_issue_ids (array)
  • Additionally, we return an "empty" row. NULL::issues

The query would look something like this:

SELECT NULL::issues AS records, project_ids, max_issue_ids FROM (query to do the array transformation)

Note: the queries are quite complex to grasp at first. You can see full examples in the depesz execution plan below.

Part 3: Find next item and replace the row in the array (collector query)

This part is responsible for collecting the rows for the result. Algorithm:

  • Find the maximum value and the position index from the previously constructed max_issue_ids array.
  • Using the array value (max_issue_ids[position]), find the next record (id < max_issue_ids[position] LIMIT 1).
  • Construct a new SELECT that replaces max_issue_ids[position] with `next_record.id
  • Repeat until reaching the LIMIT
SELECT (SELECT issues FROM issues WHERE id = max_issue_ids[position]), project_ids, max_issue_id[position - 1] || next_record.id || max_issue_id[position + 1]

Note: the first value of the SELECT is a "row", this column will contain the sorted records.

  • Part 1: Load N project ids (number of projects) - (index only scan)
  • Part 2: Load N issue ids - (index only scan)
  • Part 3: Load M (LIMIT) issue ids - (index only scan) + Load M rows - (index scan)

Costs for 1000 projects, loading 20 issues in order: 2020 "rows" with index only scans and 20 "rows" index scans.

The original query would probably load all issues in the group hierarchy:

  • 1000 index only scans for project ids.
  • If we have 10K issues then it's 10K "rows" with index scan and an in-memory sort.

List Issues within a Group and its subgroups

  • The query is ordered by the following columns: ORDER BY "issues"."updated_at" DESC, "issues"."id" DESC
  • The original query is similar to the query on the group issues page (no permission check and state filter).

Results:

Determining the minimum or maximum relative position in a Group

  • The query is used in issue rebalancing (issue placement).

Results:

Example snippet

Iterating over issues in gitlab-org ordered by relative position and id. Special keyset Order object is needed because we use nulls last ordering.

# covers the "idx_issues_on_project_id_and_rel_asc_and_id" btree (project_id, relative_position, id) index
order = Gitlab::Pagination::Keyset::Order.build([
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'relative_position',
    column_expression: Issue.arel_table[:relative_position],
    order_expression: Gitlab::Database.nulls_last_order('relative_position', 'ASC'),
    reversed_order_expression: Gitlab::Database.nulls_first_order('relative_position', 'DESC'),
    order_direction: :asc,
    nullable: :nulls_last,
    distinct: false,
  ),
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'id',
    order_expression: Issue.arel_table[:id].asc
  )
])

opts = {
  use_recursive_union_with_multi_index_scan: {
    array_scope: Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id),
    array_mapping_scope: -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) },
    finder_query: -> (_relative_position_expression, id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }
  }
}

scope = Issue.order(order).limit(20)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
  puts records.map { |r| [r.relative_position, r.id].inspect }
end

We request project ids in every iteration, we can avoid that if we build an "in-memory" table. This boosts the performance a bit:

order = Gitlab::Pagination::Keyset::Order.build([
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'relative_position',
    column_expression: Issue.arel_table[:relative_position],
    order_expression: Gitlab::Database.nulls_last_order('relative_position', 'ASC'),
    reversed_order_expression: Gitlab::Database.nulls_first_order('relative_position', 'DESC'),
    order_direction: :asc,
    nullable: :nulls_last,
    distinct: false,
  ),
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'id',
    order_expression: Issue.arel_table[:id].asc
  )
])

project_ids = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).pluck(:id).map { |id| "(#{id}) "}.join(',')
array_scope = Project.from("(VALUES #{project_ids}) ids (id)").select(:id)
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
finder_query = -> (__, id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }

opts = {
  use_recursive_union_with_multi_index_scan: {
    array_scope: array_scope,
    array_mapping_scope: array_mapping_scope,
    finder_query: finder_query
  }
}

scope = Issue.order(order).limit(20)
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
    puts records.map { |r| [r.id].inspect }
end

How to construct the options

Consider the following group level query:

Issue.where(project_id: Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)).order(:id)
  1. Extract the group level query, make sure only the id column is selected.
array_scope = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)
  1. We need a way to "map" the projects.id value to the issues query. For this, the code will call a lambda where the number of arguments will be the same as the selected columns in the array_scope (id only in this case). This scope will be merged with another query by the RecursiveUnionWithMultiIndexScan class.
# id_expression is a safe SQL fragment so we need to use Arel to build an `eq` query
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
  1. We need to find the row based on the ORDER BY columns. Define a lambda that will receive SQL fragments referencing ORDER BY column values. We only order by id, so we have one parameter.
finder_query = -> (id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }

Final code:

array_scope = Project.where(namespace_id: Group.find(9970).self_and_descendants.select("id")).select(:id)
array_mapping_scope = -> (id_expression) { Issue.where(Issue.arel_table[:project_id].eq(id_expression)) }
finder_query = -> (id_expression) { Issue.where(Issue.arel_table[:id].eq(id_expression)) }

opts = {
  use_recursive_union_with_multi_index_scan: {
    array_scope: array_scope,
    array_mapping_scope: array_mapping_scope,
    finder_query: finder_query
  }
}

# needs project_id, id index
scope = Issue.order(:id).limit(20)

# iterate over the rows, optionally we could just paginate...
Gitlab::Pagination::Keyset::Iterator.new(scope: scope, **opts).each_batch(of: 20) do |records|
  puts records.map { |r| [r.id].inspect }
end

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

Related to #330977 (closed)

Edited by Adam Hegyi

Merge request reports