Skip to content

Lazy array resolver

Alex Kalderimis requested to merge ajk-gql-lazy-collections into master

What does this MR do?

This adds a batch loader capable of loading collections of values, rather than just single items. This comes at the cost of returning arrays rather than relations, so we cannot use this in connections, with the associated pagination that they provide, but by batching requests, we can get significant performance benefits.

Why do we need this?

If we run the following query:

query {
  a: project(fullPath: "x/a") {
    issues {
      nodes { reference }
    }
  }
  b: project(fullPath: "x/b") {
    issues {
      nodes { reference }
    }
  }
}

we will need to run a query agains the issues table for each project - here twice, once for where project_id = X and once for where project_id = Y. But we can run these queries at the same time in a union, which reduces the number of queries we need to run here, which saves us one DB request in this query, but potentially more as this scales up.

A more real world example is the following, as proposed in !44703 (merged):

query {
  currentUser {
    authoredMergeRequests(state: open) {
      headPipeline {
        jobs {
          name
          status
          duration
          trace {
            sections { name }
          }
        }
      }
    }
  }
}

Where we fetch the lastest pipeline jobs for all open MRs of the current user. If we implement Pipeline.jobs using a caching array resolver, this entire query will only issue 4 (logical) queries - one for the MRs of the current user, one of the pipeline (assuming effective batch-loading), one for the jobs, and then one for the trace sections.

If implemented naively, this would be linear in the number of merge requests and then linear in the number of trace sections for each MR.

If we returned 30 merge-requests, each of which have 100 jobs and each job has 5 sections (realistic and conservative numbers), this means the numbers would be:

with array-resolver without
4 1(mrs) + 1(pipelines) + 30(jobs) + 100(sections) = 131

What are the alternatives?

We currently already have some tools for reducing N+1 performance issues, namely preloading using lookahead. The limitations of using preloading are

  • that this only works for collections nested within collections (e.g. issues -> assignees) but not when the nodes are disjoint in the query (as with project A -> issues and project B -> issues in the example above).
  • preloading only works for associations that rails knows about, and the benefit disappears as soon as we start accepting arguments to filter the list or use finders. There are some workarounds for the problem of filtering, but they are somewhat ugly and we don't have general support for this pattern. We have no way at all of using preloading if we need to use finders.

What are the limitations?

This approach is not perfect - it avoids running multiple queries but:

  • it returns arrays, not relations, which means that pagination is less efficient, as we always have to load the full result set. This makes this approach only suitable for small result sets. This is OK for things like pipeline.jobs, or trace.sections where the cardinality is low and the client will typically want all results and not want to page.
  • the queries could be more efficient. If we do a union of three queries, and an item is in two of them, we will load a logically identical item twice from two distinct rows. It may be possible to improve on this.
  • each branch of the union is the same query as if we were running each query separately. We get batching, and thus less network overhead, and less DB connection contention, but the raw speedup is likely to be modest after taking that overhead into account.

SQL

If we use this new abstraction to batch issue lookups as follows:

class X < Resolvers::CachingArrayResolver
  def model_class
    ::Issue
  end

  def query_input(project_id:)
    project_id
  end

  def query_for(project_id)
    model_class.where(project_id: project_id)
  end
end

Use of this resolver produces queries like:

SELECT "issues".* FROM ((SELECT "issues".*, 0 AS union_member_idx FROM "issues" WHERE "issues"."project_id" = 1 LIMIT 100)
UNION
(SELECT "issues".*, 1 AS union_member_idx FROM "issues" WHERE "issues"."project_id" = 2 LIMIT 100)
UNION
(SELECT "issues".*, 2 AS union_member_idx FROM "issues" WHERE "issues"."project_id" = 3 LIMIT 100)) issues

Rather than three separate queries for:

SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 1

and

SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 2

and

SELECT "issues".* FROM "issues" WHERE "issues"."project_id" = 3

An example query plan for this is as follows:

 HashAggregate  (cost=403.20..406.20 rows=300 width=1284) (actual time=14.265..14.269 rows=0 loops=1)

   Group Key: issues.id, issues.title, issues.author_id, issues.project_id, issues.created_at, issues.updated_at, issues.description, issues.milestone_id, issues.iid, issues.updated_by_id, issues.weight, issues.confidential, issues.moved_to_id, issues.due_date, issues.lock_version, issues.title_html, issues.description_html, issues.time_estimate, issues.relative_position, issues.service_desk_reply_to, issues.cached_markdown_version, issues.last_edited_at, issues.last_edited_by_id, issues.discussion_locked, issues.closed_at, issues.closed_by_id, issues.state_id, issues.duplicated_to_id, issues.promoted_to_epic_id, issues.health_status, issues.external_key, issues.sprint_id, issues.issue_type, issues.blocking_issues_count, (0)

   Buffers: shared hit=8 read=4

   I/O Timings: read=14.103
   ->  Append  (cost=0.56..376.95 rows=300 width=1284) (actual time=14.260..14.262 rows=0 loops=1)
         Buffers: shared hit=8 read=4
         I/O Timings: read=14.103
         ->  Limit  (cost=0.56..124.15 rows=100 width=1271) (actual time=14.201..14.202 rows=0 loops=1)
               Buffers: shared read=4
               I/O Timings: read=14.103
               ->  Index Scan using index_issues_on_project_id_and_iid on public.issues  (cost=0.56..299.64 rows=242 width=1271) (actual time=14.199..14.199 rows=0 loops=1)
                     Index Cond: (issues.project_id = 1)
                     Buffers: shared read=4
                     I/O Timings: read=14.103
         ->  Limit  (cost=0.56..124.15 rows=100 width=1271) (actual time=0.045..0.046 rows=0 loops=1)
               Buffers: shared hit=4
               ->  Index Scan using index_issues_on_project_id_and_iid on public.issues issues_1  (cost=0.56..299.64 rows=242 width=1271) (actual time=0.043..0.043 rows=0 loops=1)
                     Index Cond: (issues_1.project_id = 2)
                     Buffers: shared hit=4
         ->  Limit  (cost=0.56..124.15 rows=100 width=1271) (actual time=0.010..0.010 rows=0 loops=1)
               Buffers: shared hit=4
               ->  Index Scan using index_issues_on_project_id_and_iid on public.issues issues_2  (cost=0.56..299.64 rows=242 width=1271) (actual time=0.009..0.009 rows=0 loops=1)
                     Index Cond: (issues_2.project_id = 3)
                     Buffers: shared hit=4

Summary:

Summary:

Time: 15.476 ms
  - planning: 0.936 ms
  - execution: 14.540 ms
    - I/O read: 14.103 ms
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 8 (~64.00 KiB) from the buffer pool
  - reads: 4 (~32.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

This is an opt-in enhancement, and includes tests for its guarantees. Since use of this abstraction will result in larger queries (that is the whole point), it is possible that we might see query timeouts. A database review is necessary.

Edited by Alex Kalderimis

Merge request reports