Update Runner/RunnerManager job_execution_status to read from ci_running_builds

What does this MR do and why?

Background

These two queries have been identified as contributing to LWLock contention in gitlab-com/gl-infra/production-engineering#28600 (comment 3285345014). They are:

Ci::Runner.with_executing_builds

Normalized query
SELECT "ci_runners".*
FROM "ci_runners"
WHERE
    "ci_runners"."id" = $1
    AND (
        EXISTS (
            SELECT
                $2
            FROM
                "p_ci_builds"
            WHERE
                "p_ci_builds"."type" = $3
                AND "p_ci_builds"."status" IN ($4, $5)
                AND ("p_ci_builds".runner_id = "ci_runners".id)
        )
    )

Ci::RunnerManager.with_executing_builds

Normalized query
SELECT "ci_runner_machines".*
FROM "ci_runner_machines"
WHERE
    "ci_runner_machines"."id" = $1
    AND (
        EXISTS (
            SELECT
                $2
            FROM
                "p_ci_builds"
                INNER JOIN "p_ci_runner_machine_builds" ON "p_ci_runner_machine_builds"."partition_id" IS NOT NULL
                AND "p_ci_runner_machine_builds"."build_id" = "p_ci_builds"."id"
                AND "p_ci_runner_machine_builds"."partition_id" = "p_ci_builds"."partition_id"
            WHERE
                "p_ci_builds"."type" = $3
                AND "p_ci_builds"."status" IN ($4, $5)
                AND (
                    "p_ci_builds".runner_id = "ci_runner_machines".runner_id
                )
                AND (
                    "p_ci_runner_machine_builds".runner_machine_id = "ci_runner_machines".id
                )
        )
    )

They are used in our APIs to determine the job_execution_status value (active or idle) for a Runner and RunnerManager, respectively. EXECUTING_STATUSES are defined as either running or canceling.

The primary problem with these queries is that there is no partition pruning applied to the p_ci_builds look up.

To improve them, we decided to remove their dependency on p_ci_builds altogether and read from ci_running_builds instead (a much smaller, non-partitioned table). The only downside to this is we will no longer count canceling builds towards an "active" status. But this is deemed a reasonable compromise since the canceling state is transient and does not contribute significantly to the overall state of a Runner/RunnerManager.

This MR

This MR replaces the use of with_executing_builds query with ids_with_running_jobs. It effectively means that a Runner/RunnerManager's job_execution_status is now only dependent on whether or not it has records in ci_running_builds. This significantly reduces the LWLocks needed by the queries and improves their overall performance.

This change is behind a feature flag: ci_read_job_execution_status_from_running_builds. Roll-out issue: #598784

References

Database query plans

See MR comments.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #598582 (closed)

Edited by Leaminn Ma

Merge request reports

Loading