Update Runner/RunnerManager job_execution_status to read from ci_running_builds
What does this MR do and why?
Background
These two queries have been identified as contributing to LWLock contention in gitlab-com/gl-infra/production-engineering#28600 (comment 3285345014). They are:
Ci::Runner.with_executing_builds
Normalized query
SELECT "ci_runners".*
FROM "ci_runners"
WHERE
"ci_runners"."id" = $1
AND (
EXISTS (
SELECT
$2
FROM
"p_ci_builds"
WHERE
"p_ci_builds"."type" = $3
AND "p_ci_builds"."status" IN ($4, $5)
AND ("p_ci_builds".runner_id = "ci_runners".id)
)
)Ci::RunnerManager.with_executing_builds
Normalized query
SELECT "ci_runner_machines".*
FROM "ci_runner_machines"
WHERE
"ci_runner_machines"."id" = $1
AND (
EXISTS (
SELECT
$2
FROM
"p_ci_builds"
INNER JOIN "p_ci_runner_machine_builds" ON "p_ci_runner_machine_builds"."partition_id" IS NOT NULL
AND "p_ci_runner_machine_builds"."build_id" = "p_ci_builds"."id"
AND "p_ci_runner_machine_builds"."partition_id" = "p_ci_builds"."partition_id"
WHERE
"p_ci_builds"."type" = $3
AND "p_ci_builds"."status" IN ($4, $5)
AND (
"p_ci_builds".runner_id = "ci_runner_machines".runner_id
)
AND (
"p_ci_runner_machine_builds".runner_machine_id = "ci_runner_machines".id
)
)
)They are used in our APIs to determine the job_execution_status value (active or idle) for a Runner and RunnerManager, respectively. EXECUTING_STATUSES are defined as either running or canceling.
The primary problem with these queries is that there is no partition pruning applied to the p_ci_builds look up.
To improve them, we decided to remove their dependency on p_ci_builds altogether and read from ci_running_builds instead (a much smaller, non-partitioned table). The only downside to this is we will no longer count canceling builds towards an "active" status. But this is deemed a reasonable compromise since the canceling state is transient and does not contribute significantly to the overall state of a Runner/RunnerManager.
This MR
This MR replaces the use of with_executing_builds query with ids_with_running_jobs. It effectively means that a Runner/RunnerManager's job_execution_status is now only dependent on whether or not it has records in ci_running_builds. This significantly reduces the LWLocks needed by the queries and improves their overall performance.
This change is behind a feature flag: ci_read_job_execution_status_from_running_builds. Roll-out issue: #598784
References
- Resolves Investigate and improve query -6715295008265123... (#598582 - closed)
- FF roll-out issue: [FF] `ci_read_job_execution_status_from_running... (#598784)
Database query plans
See MR comments.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #598582 (closed)