High database query timings when canceling pending pipelines
This is coming from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12326#note_484146016, where a problematic query has been identified. The database group looked into this and this is a summary:
On GitLab.com, we observe high query timings related to canceling pending pipelines. During high traffic times, we can see query timings at about 400ms, but it can go up as high as 600ms.
In Thanos, we can see both the high query timings and their frequency (going up to 10/s):
SELECT "ci_pipelines".*
FROM "ci_pipelines"
WHERE "ci_pipelines"."project_id" = 1
AND ("ci_pipelines"."source" IN (1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12)
OR "ci_pipelines"."source" IS NULL)
AND "ci_pipelines"."ref" = 'some-ref'
AND "ci_pipelines"."id" NOT IN (
WITH RECURSIVE "base_and_descendants" AS (
(SELECT "ci_pipelines".* FROM "ci_pipelines" WHERE "ci_pipelines"."id" = 35)
UNION
(SELECT "ci_pipelines".* FROM "ci_pipelines", "base_and_descendants", "ci_sources_pipelines"
WHERE "ci_sources_pipelines"."pipeline_id" = "ci_pipelines"."id"
AND "ci_sources_pipelines"."source_pipeline_id" = "base_and_descendants"."id"
AND "ci_sources_pipelines"."source_project_id" = "ci_sources_pipelines"."project_id")
)
SELECT "id" FROM "base_and_descendants" AS "ci_pipelines")
AND "ci_pipelines"."sha" != 'abc123'
AND ("ci_pipelines"."status" IN ('created','waiting_for_resource','preparing','pending','running','scheduled'))
AND (
NOT EXISTS (
SELECT "ci_builds".*
FROM "ci_builds"
INNER JOIN "ci_builds_metadata"
ON "ci_builds_metadata"."build_id" = "ci_builds"."id"
WHERE "ci_builds"."type" = 'Ci::Build'
AND (ci_builds.commit_id = ci_pipelines.id)
AND ("ci_builds"."status" IN ('running','success','failed'))
AND (ci_builds_metadata.id NOT IN (
SELECT "ci_builds_metadata"."id"
FROM "ci_builds_metadata"
WHERE (ci_builds_metadata.build_id = ci_builds.id)
AND "ci_builds_metadata"."interruptible" = TRUE)
)
)
)
ORDER BY "ci_pipelines"."id" ASC
LIMIT 1000
A query plan can be found here: https://explain.depesz.com/s/WREh#l15
We found that the following query seems to originate from here in CancelPendingPipelines
, which is called from Ci::CreatePipelineService
.
We have not identified a straight-forward solution for this. Please see gitlab-org/database-team/team-tasks#131 (comment 503034740) for discussion of changes we can apply to the data model to help this query.
Technical proposal
- limit the pipelines query to search only for pipelines created in the last week.
- split the query into two parts: load the pipelines and use their ids to find the ones that don't have started uninterruptible builds
Future considerations: move the canceling process async and use a read replica to find the pipelines.
More details: #321695 (comment 608300470)