Skip to content

GraphQL: Limit the amount of ids loaded when using package build infos

🏞 Context

When a package is pushed to the GitLab Package Registry from a CI job, the Package Registry will record a "link" between the package model and the CI pipeline object. This is for audit reasons (to be able questions like: how was this package pushed to the registry?).

Here is the (simplified) tables schema:

flowchart TD
    packages_build_infos --> packages_packages
    packages_build_infos --> ci_pipelines

The link is modeled quite simply. We have a build infos (table packages_build_infos) object that has two belongs_to statements that will reference a package (table packages_packages) and a pipeline (table ci_pipelines).

On the package model, to ease the object navigation, we have these statements:

has_many :build_infos, inverse_of: :package
has_many :pipelines, through: :build_infos, disable_joins: true

Notice the has_many through for :pipelines. During the database decomposition, we had to remove JOIN queries to ci_* tables. That is why we used the disable_joins: true option. This option will simply split the JOIN query in two:

  1. get the pipeline_ids from the all the build info objects of a given package
  2. select the pipelines with those pipeline_ids.

During our analysis, we noticed that some GraphQL queries could lead to a ~performance issue with the changes for the database decomposition. Here is the interesting excerpt:

GraphQL package -> pipelines

On `master`
Packages::Package Load (0.4ms)  SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."id" = 619 
   lib/gitlab/graphql/loaders/batch_model_loader.rb:20:in `block in find'

[...]

Ci::Pipeline Load (0.6ms)  SELECT "ci_pipelines".* FROM "ci_pipelines" INNER JOIN "packages_build_infos" ON "ci_pipelines"."id" = "packages_build_infos"."pipeline_id" WHERE "packages_build_infos"."package_id" = 619 ORDER BY "ci_pipelines"."id" DESC LIMIT 100
With the changes above
Packages::Package Load (1.9ms)  SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."id" = 619
   lib/gitlab/graphql/loaders/batch_model_loader.rb:20:in `block in find'

[...]

(0.8ms)  SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 619
  ↳ lib/gem_extensions/active_record/disable_joins/associations/association_scope.rb:32:in `block in last_scope_chain'
  Ci::Pipeline Load (2.7ms)  SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (135, 207, 106) ORDER BY "ci_pipelines"."id" DESC LIMIT 100 
  ↳ lib/gitlab/graphql/pagination/keyset/connection.rb:108:in `nodes'
  • No finders. Direct load of the package.
  • Impact: we have a cross join query

So basically, because of the disabled_joins: true, the query that load pipeline_id from the build objects load all the pipeline_id at once. As you can see above, we have some packages filled with several thousands of build info objects = several thousands pipeline_ids.

We opened #342881 (closed) for this issue and this MR tries to limit the number of pipeline_ids.

Digging deeper

At the SQL level, here is what we have:

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 619

Ci::Pipeline Load (2.7ms)  SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (135, 207, 106) ORDER BY "ci_pipelines"."id" DESC LIMIT 100 

We can see that

  • the first query loads all the existing ids. <- 😱
  • the second query loads a subset (there is a LIMIT option)

This leads us to: why would you load all the pipeline_ids in the first query if you load only 100 pipelines in the second query.

If we could apply the pagination parameters on that query on build objects, we would perfectly limit the size of the results returned.

Refining the solution

Right from the beginning, we knew that we had to create a "pipelines" custom resolver for the Types::Packages::PackageDetailsType. That resolver will not simply call package.pipelines, instead we need to get the pagination parameters and apply them on package.build_infos. From there, we can .pluck the pipeline_ids and load the Ci::Pipeline objects by id.

Accessing the pagination parameters from a resolver class is not directly possible. To workaround this limitation, we had to create a new connection extension, that will not remove the pagination parameters. Instead they are going to be passed to the resolver.

In the resolver, it is then "just" a question to read those pagination parameters and build the proper build_infos query.

🔮 Additional points

  • We have several GraphQL queries that can load packages but here we solely deal with the query package -> pipelines, eg. a single package is loaded and its pipelines are loaded.
    • In particular, we don't change what happens in multi packages queries. See #346076.
  • There is on ongoing discussion about completely removing those pipelines fields from the package GraphQL types. See #342881 (comment 747991561).
    • In short, if the plan is accepted, those pipelines fields in package related GraphQL types are going to be sunset and follow the deprecation process we have in place.

What does this MR do and why?

  • Add a new connection extension that don't remove the pagination parameters: Gitlab::Graphql::Extensions::ConnectionWithPaginationParameters
  • Add a new resolver for the pipelines field on the Types::Packages::PackageDetailsType: Resolvers::PackagePipelinesResolver
  • Update/add the related specs
    • In particular, improve the request specs for the single package query

Screenshots or screen recordings

I created a package with 10 pipelines.

Here are the results

No pagination params

GraphQL
query {
  package(id: "gid://gitlab/Packages::Package/788") {
    id
    pipelines {
      nodes {
        id
      }
      pageInfo {
        endCursor
        hasNextPage
        hasPreviousPage
        startCursor
      }
    }
  }
}

On master

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) ORDER BY "ci_pipelines"."id" DESC LIMIT 100

With this MR

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 101

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (365, 364, 363, 362, 361, 360, 359, 358, 357, 356) ORDER BY "ci_pipelines"."id" DESC LIMIT 100
  • On this MR, the build infos query is properly limited

First: 2

GraphQL
query {
  package(id: "gid://gitlab/Packages::Package/788") {
    id
    pipelines(first: 2) {
      nodes {
        id
      }
      pageInfo {
        endCursor
        hasNextPage
        hasPreviousPage
        startCursor
      }
    }
  }
}

On master

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) ORDER BY "ci_pipelines"."id" DESC LIMIT 2

With this MR

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 3

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (365, 364, 363) ORDER BY "ci_pipelines"."id" DESC LIMIT 2
  • On this MR, the build infos query is properly limited

First: 2, After: x

GraphQL

query {
  package(id: "gid://gitlab/Packages::Package/788") {
    id
    pipelines(first: 2, after: "eyJpZCI6IjM2NCJ9") {
      nodes {
        id
      }
      pageInfo {
        endCursor
        hasNextPage
        hasPreviousPage
        startCursor
      }
    }
  }
}

On master

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) AND ("ci_pipelines"."id" < 364) ORDER BY "ci_pipelines"."id" DESC LIMIT 2

With this MR

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL AND (pipeline_id < 364) ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 3

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (363, 362, 361) AND ("ci_pipelines"."id" < 364) ORDER BY "ci_pipelines"."id" DESC LIMIT 2
  • On this MR, the build infos query is properly limited

Last: 2

GraphQL
query {
  package(id: "gid://gitlab/Packages::Package/788") {
    id
    pipelines(last: 2) {
      nodes {
        id
      }
      pageInfo {
        endCursor
        hasNextPage
        hasPreviousPage
        startCursor
      }
    }
  }
}

On master

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) ORDER BY "ci_pipelines"."id" ASC LIMIT 3

With this MR

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" ASC LIMIT 3

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358) ORDER BY "ci_pipelines"."id" ASC LIMIT 3
  • On this MR, the build infos query is properly limited

Last: 2, Before: x

GraphQL
query {
  package(id: "gid://gitlab/Packages::Package/788") {
    id
    pipelines(last: 2, before: "eyJpZCI6IjM1NyJ9") {
      nodes {
        id
      }
      pageInfo {
        endCursor
        hasNextPage
        hasPreviousPage
        startCursor
      }
    }
  }
}

On master

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) AND ("ci_pipelines"."id" > 357) ORDER BY "ci_pipelines"."id" ASC LIMIT 3

With this MR

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL AND (pipeline_id > 357) ORDER BY "packages_build_infos"."pipeline_id" ASC LIMIT 3

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (358, 359, 360) AND ("ci_pipelines"."id" > 357) ORDER BY "ci_pipelines"."id" ASC LIMIT 3
  • On this MR, the build infos query is properly limited

How to set up and validate locally

  1. Let's create a package with 10 pipelines. In a rails console:
    # creates a package
    def fixture_file_upload(*args, **kwargs)
      Rack::Test::UploadedFile.new(*args, **kwargs)
    end
    
    # note the package id here
    pkg = FactoryBot.create(:npm_package, project: Project.first)
    
    # creates 10 pipelines
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
    Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
  2. You can now play around with all the cases listed in the Screenshots section.

💿 Database review

🆙 Migration up

$ rails db:migrate

== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: migrating ==
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:packages_build_infos, [:package_id, :pipeline_id], {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently})
   -> 0.0035s
-- execute("SET statement_timeout TO 0")
   -> 0.0007s
-- add_index(:packages_build_infos, [:package_id, :pipeline_id], {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently})
   -> 0.0151s
-- execute("RESET statement_timeout")
   -> 0.0006s
-- transaction_open?()
   -> 0.0000s
-- indexes(:packages_build_infos)
   -> 0.0016s
-- remove_index(:packages_build_infos, {:algorithm=>:concurrently, :name=>"idx_packages_build_infos_on_package_id"})
   -> 0.0078s
== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: migrated (0.0327s) 

Migration down

$ rails db:rollback

== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: reverting ==
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:packages_build_infos, :package_id, {:name=>"idx_packages_build_infos_on_package_id", :algorithm=>:concurrently})
   -> 0.0027s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- add_index(:packages_build_infos, :package_id, {:name=>"idx_packages_build_infos_on_package_id", :algorithm=>:concurrently})
   -> 0.0050s
-- execute("RESET statement_timeout")
   -> 0.0006s
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:packages_build_infos, [:package_id, :pipeline_id], {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently})
   -> 0.0024s
-- remove_index(:packages_build_infos, {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently, :column=>[:package_id, :pipeline_id]})
   -> 0.0057s
== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: reverted (0.0203s) 

🔬 Explain plans

🛃 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports