GraphQL: Limit the amount of ids loaded when using package build infos
🏞 Context
When a package is pushed to the GitLab Package Registry from a CI job, the Package Registry will record a "link" between the package model and the CI pipeline object. This is for audit reasons (to be able questions like: how was this package pushed to the registry?).
Here is the (simplified) tables schema:
flowchart TD
packages_build_infos --> packages_packages
packages_build_infos --> ci_pipelines
The link is modeled quite simply. We have a build infos (table packages_build_infos
) object that has two belongs_to
statements that will reference a package (table packages_packages
) and a pipeline (table ci_pipelines
).
On the package model, to ease the object navigation, we have these statements:
has_many :build_infos, inverse_of: :package
has_many :pipelines, through: :build_infos, disable_joins: true
Notice the has_many through
for :pipelines
. During the database decomposition, we had to remove JOIN
queries to ci_*
tables. That is why we used the disable_joins: true
option. This option will simply split the JOIN
query in two:
- get the
pipeline_id
s from the all the build info objects of a given package - select the pipelines with those
pipeline_id
s.
During our analysis, we noticed that some GraphQL queries could lead to a ~performance issue with the changes for the database decomposition. Here is the interesting excerpt:
GraphQL package -> pipelines
On `master`
Packages::Package Load (0.4ms) SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."id" = 619 ↳ lib/gitlab/graphql/loaders/batch_model_loader.rb:20:in `block in find' [...] Ci::Pipeline Load (0.6ms) SELECT "ci_pipelines".* FROM "ci_pipelines" INNER JOIN "packages_build_infos" ON "ci_pipelines"."id" = "packages_build_infos"."pipeline_id" WHERE "packages_build_infos"."package_id" = 619 ORDER BY "ci_pipelines"."id" DESC LIMIT 100
With the changes above
Packages::Package Load (1.9ms) SELECT "packages_packages".* FROM "packages_packages" WHERE "packages_packages"."id" = 619 ↳ lib/gitlab/graphql/loaders/batch_model_loader.rb:20:in `block in find' [...] (0.8ms) SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 619 ↳ lib/gem_extensions/active_record/disable_joins/associations/association_scope.rb:32:in `block in last_scope_chain' Ci::Pipeline Load (2.7ms) SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (135, 207, 106) ORDER BY "ci_pipelines"."id" DESC LIMIT 100 ↳ lib/gitlab/graphql/pagination/keyset/connection.rb:108:in `nodes'
- No finders. Direct load of the package.
⚠ Impact: we have a cross join query
- Used
disabled_joins: true
to solve it🐛 Can load an unbounded amount of pipeline ids in memory- We have around ~2000 requests of this type per day on this GraphQL.
- The highest count of build infos we have for a single package on gitlab.com is
~26K
.
So basically, because of the disabled_joins: true
, the query that load pipeline_id
from the build objects load all the pipeline_id
at once. As you can see above, we have some packages filled with several thousands of build info objects = several thousands pipeline_ids
.
We opened #342881 (closed) for this issue and this MR tries to limit the number of pipeline_ids
.
⛏ Digging deeper
At the SQL level, here is what we have:
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 619
Ci::Pipeline Load (2.7ms) SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (135, 207, 106) ORDER BY "ci_pipelines"."id" DESC LIMIT 100
We can see that
- the first query loads all the existing ids. <-
😱 - the second query loads a subset (there is a
LIMIT
option)
This leads us to: why would you load all the pipeline_id
s in the first query if you load only 100
pipelines in the second query.
If we could apply the pagination parameters on that query on build objects, we would perfectly limit the size of the results returned.
⚗ Refining the solution
Right from the beginning, we knew that we had to create a "pipelines" custom resolver for the Types::Packages::PackageDetailsType
. That resolver will not simply call package.pipelines
, instead we need to get the pagination parameters and apply them on package.build_infos
. From there, we can .pluck
the pipeline_ids
and load the Ci::Pipeline
objects by id.
Accessing the pagination parameters from a resolver class is not directly possible. To workaround this limitation, we had to create a new connection extension, that will not remove the pagination parameters. Instead they are going to be passed to the resolver.
In the resolver, it is then "just" a question to read those pagination parameters and build the proper build_infos query.
🔮 Additional points
- We have several GraphQL queries that can load packages but here we solely deal with the query
package -> pipelines
, eg. a single package is loaded and its pipelines are loaded.- In particular, we don't change what happens in multi packages queries. See #346076.
- There is on ongoing discussion about completely removing those
pipelines
fields from the package GraphQL types. See #342881 (comment 747991561).- In short, if the plan is accepted, those
pipelines
fields in package related GraphQL types are going to be sunset and follow the deprecation process we have in place.
- In short, if the plan is accepted, those
What does this MR do and why?
- Add a new connection extension that don't remove the pagination parameters:
Gitlab::Graphql::Extensions::ConnectionWithPaginationParameters
- Add a new resolver for the
pipelines
field on theTypes::Packages::PackageDetailsType
:Resolvers::PackagePipelinesResolver
- Update/add the related specs
- In particular, improve the request specs for the single package query
Screenshots or screen recordings
I created a package with 10 pipelines.
Here are the results
No pagination params
GraphQL
query {
package(id: "gid://gitlab/Packages::Package/788") {
id
pipelines {
nodes {
id
}
pageInfo {
endCursor
hasNextPage
hasPreviousPage
startCursor
}
}
}
}
On master
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) ORDER BY "ci_pipelines"."id" DESC LIMIT 100
With this MR
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 101
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (365, 364, 363, 362, 361, 360, 359, 358, 357, 356) ORDER BY "ci_pipelines"."id" DESC LIMIT 100
- On this MR, the build infos query is properly limited
✅
First: 2
GraphQL
query {
package(id: "gid://gitlab/Packages::Package/788") {
id
pipelines(first: 2) {
nodes {
id
}
pageInfo {
endCursor
hasNextPage
hasPreviousPage
startCursor
}
}
}
}
On master
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) ORDER BY "ci_pipelines"."id" DESC LIMIT 2
With this MR
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 3
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (365, 364, 363) ORDER BY "ci_pipelines"."id" DESC LIMIT 2
- On this MR, the build infos query is properly limited
✅
First: 2, After: x
GraphQL
query {
package(id: "gid://gitlab/Packages::Package/788") {
id
pipelines(first: 2, after: "eyJpZCI6IjM2NCJ9") {
nodes {
id
}
pageInfo {
endCursor
hasNextPage
hasPreviousPage
startCursor
}
}
}
}
On master
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) AND ("ci_pipelines"."id" < 364) ORDER BY "ci_pipelines"."id" DESC LIMIT 2
With this MR
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL AND (pipeline_id < 364) ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 3
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (363, 362, 361) AND ("ci_pipelines"."id" < 364) ORDER BY "ci_pipelines"."id" DESC LIMIT 2
- On this MR, the build infos query is properly limited
✅
Last: 2
GraphQL
query {
package(id: "gid://gitlab/Packages::Package/788") {
id
pipelines(last: 2) {
nodes {
id
}
pageInfo {
endCursor
hasNextPage
hasPreviousPage
startCursor
}
}
}
}
On master
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) ORDER BY "ci_pipelines"."id" ASC LIMIT 3
With this MR
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" ASC LIMIT 3
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358) ORDER BY "ci_pipelines"."id" ASC LIMIT 3
- On this MR, the build infos query is properly limited
✅
Last: 2, Before: x
GraphQL
query {
package(id: "gid://gitlab/Packages::Package/788") {
id
pipelines(last: 2, before: "eyJpZCI6IjM1NyJ9") {
nodes {
id
}
pageInfo {
endCursor
hasNextPage
hasPreviousPage
startCursor
}
}
}
}
On master
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (356, 357, 358, 359, 360, 361, 362, 363, 364, 365) AND ("ci_pipelines"."id" > 357) ORDER BY "ci_pipelines"."id" ASC LIMIT 3
With this MR
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 788 AND "packages_build_infos"."pipeline_id" IS NOT NULL AND (pipeline_id > 357) ORDER BY "packages_build_infos"."pipeline_id" ASC LIMIT 3
SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (358, 359, 360) AND ("ci_pipelines"."id" > 357) ORDER BY "ci_pipelines"."id" ASC LIMIT 3
- On this MR, the build infos query is properly limited
✅
⚙ How to set up and validate locally
- Let's create a package with 10 pipelines. In a rails console:
# creates a package def fixture_file_upload(*args, **kwargs) Rack::Test::UploadedFile.new(*args, **kwargs) end # note the package id here pkg = FactoryBot.create(:npm_package, project: Project.first) # creates 10 pipelines Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id)
- You can now play around with all the cases listed in the
Screenshots
section.
💿 Database review
🆙 Migration up
$ rails db:migrate
== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: migrating ==
-- transaction_open?()
-> 0.0000s
-- index_exists?(:packages_build_infos, [:package_id, :pipeline_id], {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently})
-> 0.0035s
-- execute("SET statement_timeout TO 0")
-> 0.0007s
-- add_index(:packages_build_infos, [:package_id, :pipeline_id], {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently})
-> 0.0151s
-- execute("RESET statement_timeout")
-> 0.0006s
-- transaction_open?()
-> 0.0000s
-- indexes(:packages_build_infos)
-> 0.0016s
-- remove_index(:packages_build_infos, {:algorithm=>:concurrently, :name=>"idx_packages_build_infos_on_package_id"})
-> 0.0078s
== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: migrated (0.0327s)
⬇ Migration down
$ rails db:rollback
== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: reverting ==
-- transaction_open?()
-> 0.0000s
-- index_exists?(:packages_build_infos, :package_id, {:name=>"idx_packages_build_infos_on_package_id", :algorithm=>:concurrently})
-> 0.0027s
-- execute("SET statement_timeout TO 0")
-> 0.0005s
-- add_index(:packages_build_infos, :package_id, {:name=>"idx_packages_build_infos_on_package_id", :algorithm=>:concurrently})
-> 0.0050s
-- execute("RESET statement_timeout")
-> 0.0006s
-- transaction_open?()
-> 0.0000s
-- index_exists?(:packages_build_infos, [:package_id, :pipeline_id], {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently})
-> 0.0024s
-- remove_index(:packages_build_infos, {:name=>"index_packages_build_infos_package_id_pipeline_id", :algorithm=>:concurrently, :column=>[:package_id, :pipeline_id]})
-> 0.0057s
== 20211202135508 AddIndexOnPackagesBuildInfosPackageIdPipelineId: reverted (0.0203s)
🔬 Explain plans
- Queries that load build infos objects: !75672 (comment 750579528)
🛃 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.