Improve pipelines field in package related GraphQL types (!82496) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 347220-packages-build-infos-graphql-field into master Mar 09, 2022

🐞 Context

Packages can be uploaded to the Package Registry through CI pipelines. When that happen, we will save the CI Pipeline id so that the backend knows that a given package has been published through a CI job.

To support the above, we use the following models:

flowchart LR
  pkg([Packages::Package])-- 1:n -->bi([Packages::BuildInfo])
  bi-- 1:1 -->pip([Ci::Pipeline])

Because for some formats, a package can be published multiple times (duplicates), we need to support that a package can be linked with many pipelines.

During the ci_* database decomposition for packages, we noticed that we had a few problems with this intermediary model in GraphQL.

🔥 The Problem

In GraphQL, we are using the has_many :pipelines, through: :build_infos shortcut that we have in the package model and this bring problems such as the amount of build_info objects pulled to get the pipelines.

This problem is deduplicated on several access levels. In other words, we can GraphQL query a package (and the related pipelines) at different levels:

for the package itself
for a project which can contain many packages
for a group which can contain many packages

See the related SQL queries in #338861 (comment 701147512).

We can notice several problems:

In some cases, we load all build_info rows and then proceed to load all the pipeline rows.
We have a n+1 situation at the group level.

This MR will try to solve all of these at once.

🚒 The Solution

The solution we went for is to stop using a has_many_through association in GraphQL and simply use the build_info object directly by exposing it in GraphQL.

Because we could be loading:

many build_info objects
many pipeline objects

We want to trigger as fewer SQL queries as possible (ideally, 2, one for each object type).

This points us to use a batch loader.

Because we're trying to load a has_many association for potentially several packages, loading many build_info objects for each package led us to use a LATERAL style query. In other words, we're going to load several "sets" of build_info rows in a single query using LATERAL.

Lastly, we want to handle pagination parameters properly. For example, if the query asks for the single most recent pipeline for each package, why would you load all build_info rows for each packages? We can load the most recent one for each package.

The above is what issue #347220 (closed) is.

Lastly a word on a hard limit we're going to have. Since we could load sets of build_info for many packages, we need to hard limit the size of each build_info set and since a build_info row is mapped to a single pipeline that hard limit will get applied to the final pipelines set returned.

🔬 What does this MR do and why?

Update the PackagePipelinesResolver to have several ways to resolve the pipelines field:
- object_field. This will simply call package.pipelines.
- old_finder. This will use the existing Packages::BuildInfosFinder.
- new_finder. This will use the newly introduced Packages::BuildInfosForManyPackagesFinder.
  - This finder has been introduced in !85073 (merged)
  - This finder will return a structure that will then use the ActiveRecordArrayConnection introduced in !84387 (merged).
Which resolve way is executed depends on the access level and the feature flag state:

Access level feature flag disabled (master like) feature flag enabled

group object_field new_finder

project object_field new_finder

package old_finder new_finder
- feature flag Rollout issue: #358432 (closed)
Remove the pipelines override from the PackageDetailsType.
Link the PackagePipelinesResolver to the PackageType.
Updates the related documentation. Since this MR fixes the issues with the deprecated pipelines field. As such, that field is not marked as deprecated any more.

Access level	feature flag disabled (`master` like)	feature flag enabled
group	`object_field`	`new_finder`
project	`object_field`	`new_finder`
package	`old_finder`	`new_finder`

🖼 Screenshots or screen recordings

🔥 Using `master`

Package level

GraphQL query

{
  package(id: "gid://gitlab/Packages::Package/40") {
    id
    pipelines {
      nodes {
        id
      }
    }
  }
}

SQL triggered

SELECT "packages_build_infos"."pipeline_id" FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" = 40 AND "packages_build_infos"."pipeline_id" IS NOT NULL ORDER BY "packages_build_infos"."pipeline_id" DESC LIMIT 100

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (312, 311, 310, 309, 308) ORDER BY "ci_pipelines"."id" DESC LIMIT 100

We see some limits being applied. This was part of !75672 (merged)

Project level

GraphQL query

{
  project(fullPath: "many/packages-1") {
    packages {
      nodes {
        pipelines {
          nodes {
            id
          }
        }
      }
    }
  }
}

SQL triggered

SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" IN (40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26) 

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312)

⚠ We load all build_info rows.

Group level

GraphQL query

{
  group(fullPath: "many") {
    packages {
      nodes {
        id
        pipelines {
          nodes {
            id
          }
        }
      }
    }
  }
}

SQL triggered

Many occurences of (n+1 situation):

SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."package_id" IN (40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26)

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312)

⚠ We load all build_info rows.
⚠ We do this for each package. (n+1 situation)

🚒 Using this MR + feature flag enabled

Package level

GraphQL query

{
  package(id: "gid://gitlab/Packages::Package/40") {
    id
    pipelines {
      nodes {
        id
      }
    }
  }
}

SQL triggered

SELECT build_infos.* FROM "packages_packages", LATERAL (SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."pipeline_id" IS NOT NULL AND "packages_build_infos"."package_id" = packages_packages.id ORDER BY "packages_build_infos"."id" DESC LIMIT 20) AS build_infos WHERE "packages_packages"."id" = 40 ORDER BY build_infos.id DESC

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (312, 311, 310, 309, 308)

2 SQL queries to load the build_infos and pipelines ✅
Limits applied ✅

(not shown) Pagination options (such as pipelines(first: 1)) properly applied.

GraphQL example

{
  package(id: "gid://gitlab/Packages::Package/40") {
    id
    pipelines(first: 1) {
      nodes {
        id
      }
    }
  }
}

Project level

GraphQL query

{
  project(fullPath: "many/packages-1") {
    packages {
      nodes {
        pipelines {
          nodes {
            id
          }
        }
      }
    }
  }
}

SQL triggered

SELECT build_infos.* FROM "packages_packages", LATERAL (SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."pipeline_id" IS NOT NULL AND "packages_build_infos"."package_id" = packages_packages.id ORDER BY "packages_build_infos"."id" DESC LIMIT 20) AS build_infos WHERE "packages_packages"."id" IN (30, 29, 28, 27, 26) ORDER BY build_infos.id DESC

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (262, 261, 260, 259, 258, 257, 256, 255, 254, 253, 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 241, 240, 239, 238)

2 SQL queries to load the build_infos and pipelines ✅
Limits applied ✅

(not shown) Pagination options (such as pipelines(first: 1)) properly applied.

GraphQL example

{
  project(fullPath: "many/packages-1") {
    packages {
      nodes {
        pipelines(first: 1) {
          nodes {
            id
          }
        }
      }
    }
  }
}

Group level

GraphQL query

{
  group(fullPath: "many") {
    packages {
      nodes {
        id
        pipelines {
          nodes {
            id
          }
        }
      }
    }
  }
}

SQL triggered

SELECT build_infos.* FROM "packages_packages", LATERAL (SELECT "packages_build_infos".* FROM "packages_build_infos" WHERE "packages_build_infos"."pipeline_id" IS NOT NULL AND "packages_build_infos"."package_id" = packages_packages.id ORDER BY "packages_build_infos"."id" DESC LIMIT 20) AS build_infos WHERE "packages_packages"."id" IN (40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26) ORDER BY build_infos.id DESC

SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" IN (312, 311, 310, 309, 308, 307, 306, 305, 304, 303, 302, 301, 300, 299, 298, 297, 296, 295, 294, 293, 292, 291, 290, 289, 288, 287, 286, 285, 284, 283, 282, 281, 280, 279, 278, 277, 276, 275, 274, 273, 272, 271, 270, 269, 268, 267, 266, 265, 264, 263, 262, 261, 260, 259, 258, 257, 256, 255, 254, 253, 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 241, 240, 239, 238)

2 SQL queries to load the build_infos and pipelines ✅
Limits applied ✅
(not shown) Pagination options (such as pipelines(first: 1)) properly applied.

Conclusions

The nice thing with this MR is that it handles the 3 difference access levels in the same way. One LATERAL query for build_info rows and one SELECT query for pipelines.
We support the pagination parameters that the pipelines field can have for the 3 levels.

⚙ How to set up and validate locally

Create a Group
Create several Projects

For each project, open a rails console and execute:

def fixture_file_upload(*args, **kwargs)
  Rack::Test::UploadedFile.new(*args, **kwargs)
end

pkg = FactoryBot.create(:npm_package, project: Project.find(<project_id>))

# creates 5 pipelines
5.times { Packages::BuildInfo.create!(package_id: pkg.id, pipeline_id: FactoryBot.create(:ci_pipeline).id) }

Open <host>/-/graphql-explorer and play around with queries shown above.

🚦 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Apr 25, 2022 by David Fernandez

Improve pipelines field in package related GraphQL types

🐞 Context

🔥 The Problem

🚒 The Solution

🔬 What does this MR do and why?

🖼 Screenshots or screen recordings

🔥 Using master

Package level

Project level

Group level

🚒 Using this MR + feature flag enabled

Package level

Project level

Group level

Conclusions

⚙ How to set up and validate locally

🚦 MR acceptance checklist

Merge request reports

🔥 Using `master`