Improve pipelines field in package related GraphQL types
🐞 Context
Packages can be uploaded to the Package Registry through CI pipelines. When that happen, we will save the CI Pipeline id
so that the backend knows that a given package has been published through a CI job.
To support the above, we use the following models:
flowchart LR
pkg([Packages::Package])-- 1:n -->bi([Packages::BuildInfo])
bi-- 1:1 -->pip([Ci::Pipeline])
Because for some formats, a package can be published multiple times (duplicates), we need to support that a package can be linked with many pipelines.
During the ci_*
database decomposition for packages, we noticed that we had a few problems with this intermediary model in GraphQL.
🔥 The Problem
In GraphQL, we are using the has_many :pipelines, through: :build_infos
shortcut that we have in the package
model and this bring problems such as the amount of build_info
objects pulled to get the pipelines.
This problem is deduplicated on several access levels. In other words, we can GraphQL query a package (and the related pipelines) at different levels:
- for the package itself
- for a project which can contain many packages
- for a group which can contain many packages
See the related SQL queries in #338861 (comment 701147512).
We can notice several problems:
- In some cases, we load all
build_info
rows and then proceed to load all thepipeline
rows. - We have a
n+1
situation at the group level.
This MR will try to solve all of these at once.
🚒 The Solution
The solution we went for is to stop using a has_many_through
association in GraphQL and simply use the build_info
object directly by exposing it in GraphQL.
Because we could be loading:
- many
build_info
objects - many
pipeline
objects
We want to trigger as fewer SQL queries as possible (ideally, 2, one for each object type).
This points us to use a batch loader.
Because we're trying to load a has_many
association for potentially several packages, loading many build_info
objects for each package led us to use a LATERAL
style query. In other words, we're going to load several "sets" of build_info
rows in a single query using LATERAL
.
Lastly, we want to handle pagination parameters properly. For example, if the query asks for the single most recent pipeline for each package, why would you load all build_info
rows for each packages? We can load the most recent one for each package.
The above is what issue #347220 (closed) is.
Lastly a word on a hard limit we're going to have. Since we could load sets of build_info
for many packages, we need to hard limit the size of each build_info
set and since a build_info
row is mapped to a single pipeline
that hard limit will get applied to the final pipelines
set returned.
🔬 What does this MR do and why?
- Update the
PackagePipelinesResolver
to have several ways to resolve thepipelines
field:-
object_field
. This will simply callpackage.pipelines
. -
old_finder
. This will use the existingPackages::BuildInfosFinder
. -
new_finder
. This will use the newly introducedPackages::BuildInfosForManyPackagesFinder
.- This finder has been introduced in !85073 (merged)
- This finder will return a structure that will then use the
ActiveRecordArrayConnection
introduced in !84387 (merged).
-
- Which resolve way is executed depends on the access level and the feature flag state:
Access level feature flag disabled ( master
like)feature flag enabled group object_field
new_finder
project object_field
new_finder
package old_finder
new_finder
- feature flag Rollout issue: #358432 (closed)
- Remove the
pipelines
override from thePackageDetailsType
. - Link the
PackagePipelinesResolver
to thePackageType
. - Updates the related documentation. Since this MR fixes the issues with the deprecated
pipelines
field. As such, that field is not marked as deprecated any more.
🖼 Screenshots or screen recordings
🔥 Using master
Package level
-
GraphQL query
{ package(id: "gid://gitlab/Packages::Package/40") { id pipelines { nodes { id } } } }
-
SQL triggered
SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 40 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 100 SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (312, 311, 310, 309, 308) ORDER BY "ci_pipelines"."id" DESC LIMIT 100
-
We see some limits being applied. This was part of !75672 (merged)
Project level
-
GraphQL query
{ project(fullPath: "many/packages-1") { packages { nodes { pipelines { nodes { id } } } } } }
-
SQL triggered
SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" IN (40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26) SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312)
-
⚠ We load allbuild_info
rows.
Group level
-
GraphQL query
{ group(fullPath: "many") { packages { nodes { id pipelines { nodes { id } } } } } }
-
SQL triggered
Many occurences of (
n+1
situation):SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" IN (40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26) SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312)
-
⚠ We load allbuild_info
rows. -
⚠ We do this for each package. (n+1
situation)
🚒 Using this MR + feature flag enabled
Package level
-
GraphQL query
{ package(id: "gid://gitlab/Packages::Package/40") { id pipelines { nodes { id } } } }
-
SQL triggered
SELECT build_infos.* FROM "packages_packages", LATERAL (SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."pipeline_id" IS NOT NULL AND "packages_build_infos"."package_id" = packages_packages.id ORDER BY "packages_build_infos"."id" DESC LIMIT 20) AS build_infos WHERE "packages_packages"."id" = 40 ORDER BY build_infos.id DESC SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (312, 311, 310, 309, 308)
-
2
SQL queries to load thebuild_infos
andpipelines
✅ -
Limits applied
✅ -
(not shown) Pagination options (such as
pipelines(first: 1)
) properly applied.-
GraphQL example
{ package(id: "gid://gitlab/Packages::Package/40") { id pipelines(first: 1) { nodes { id } } } }
-
Project level
-
GraphQL query
{ project(fullPath: "many/packages-1") { packages { nodes { pipelines { nodes { id } } } } } }
-
SQL triggered
SELECT build_infos.* FROM "packages_packages", LATERAL (SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."pipeline_id" IS NOT NULL AND "packages_build_infos"."package_id" = packages_packages.id ORDER BY "packages_build_infos"."id" DESC LIMIT 20) AS build_infos WHERE "packages_packages"."id" IN (30, 29, 28, 27, 26) ORDER BY build_infos.id DESC SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (262, 261, 260, 259, 258, 257, 256, 255, 254, 253, 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 241, 240, 239, 238)
-
2
SQL queries to load thebuild_infos
andpipelines
✅ -
Limits applied
✅ -
(not shown) Pagination options (such as
pipelines(first: 1)
) properly applied.-
GraphQL example
{ project(fullPath: "many/packages-1") { packages { nodes { pipelines(first: 1) { nodes { id } } } } } }
-
Group level
-
GraphQL query
{ group(fullPath: "many") { packages { nodes { id pipelines { nodes { id } } } } } }
-
SQL triggered
SELECT build_infos.* FROM "packages_packages", LATERAL (SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."pipeline_id" IS NOT NULL AND "packages_build_infos"."package_id" = packages_packages.id ORDER BY "packages_build_infos"."id" DESC LIMIT 20) AS build_infos WHERE "packages_packages"."id" IN (40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26) ORDER BY build_infos.id DESC SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (312, 311, 310, 309, 308, 307, 306, 305, 304, 303, 302, 301, 300, 299, 298, 297, 296, 295, 294, 293, 292, 291, 290, 289, 288, 287, 286, 285, 284, 283, 282, 281, 280, 279, 278, 277, 276, 275, 274, 273, 272, 271, 270, 269, 268, 267, 266, 265, 264, 263, 262, 261, 260, 259, 258, 257, 256, 255, 254, 253, 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 241, 240, 239, 238)
-
2
SQL queries to load thebuild_infos
andpipelines
✅ -
Limits applied
✅ -
(not shown) Pagination options (such as
pipelines(first: 1)
) properly applied.
Conclusions
- The nice thing with this MR is that it handles the 3 difference access levels in the same way. One
LATERAL
query forbuild_info
rows and oneSELECT
query forpipelines
. - We support the pagination parameters that the
pipelines
field can have for the 3 levels.
⚙ How to set up and validate locally
- Create a Group
- Create several Projects
- For each project, open a rails console and execute:
def fixture_file_upload(*args, **kwargs) Rack::Test::UploadedFile.new(*args, **kwargs) end pkg = FactoryBot.create(:npm_package, project: Project.find(<project_id>)) # creates 5 pipelines 5.times { Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) }
- Open
<host>/-/graphql-explorer
and play around with queries shown above.
🚦 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.