Skip to content

Ci::Runner#can_pick? query for optimisation and review execution frequency

Problem

After investigating spikes over 70% of CPU usage on the database:

we find out this statement being executed thousands of times:

QueryId: -2372450153195223637

SELECT $1 AS one FROM ((SELECT "ci_runners".* FROM "ci_runners" INNER JOIN "ci_runner_projects" ON "ci_runner_projects"."runner_id" = "ci_runners"."id" WHERE "ci_runner_projects"."project_id" = $2)
UNION ALL
(SELECT "ci_runners".* FROM "ci_runners" INNER JOIN "ci_runner_namespaces" ON "ci_runner_namespaces"."runner_id" = "ci_runners"."id" INNER JOIN "namespaces" ON "namespaces"."id" = "ci_runner_namespaces"."namespace_id" AND "namespaces"."type" = $3 WHERE "namespaces"."id" IN (WITH RECURSIVE "base_and_ancestors" AS ((SELECT "namespaces".* FROM "namespaces" INNER JOIN "projects" ON "projects"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $4 AND "projects"."id" = $5)
UNION
(SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."type" = $6 AND "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT "namespaces"."id" FROM "base_and_ancestors" AS "namespaces"))
UNION ALL
(SELECT "ci_runners".* FROM "ci_runners" WHERE "ci_runners"."runner_type" = $7)) ci_runners WHERE "ci_runners"."id" = $8 LIMIT $9 /*application:web,controller:jobs,action:show,correlation_id:sqBrZDhW3F8*/

Proposal

This query originates from Ci::Runner#can_pick?. It is being used in a few places, in a RegisterJobService and while calculating if a build is stuck.

Removing this query from the RegisterJobService might be risky and difficult, but it is possible that the main source of the load is the def any_runners_online? method presented below. We use it to show "stuck" badge next to a build that can not be picked by runners.

I suggest adding a feature flag to this method, to exclude can_pick? method call to run an experiment. It will allow us to validate that this is the root cause. From there we will be able to work with product and engineers on improving performance.

app/models/ci/build.rb
713:    def any_runners_online?
714-      project.any_runners? { |runner| runner.active? && runner.online? && runner.can_pick?(self) }
715-    end
716-
717-    def stuck?
718:      pending? && !any_runners_online?
719-    end

Prometheus dashboard

Prometheus dashboard for this query on the primary database

PromQL:

sum(rate(pg_stat_statements_calls{queryid=~"-2372450153195223637", env="gprd", monitor="db", type="patroni"}[1m]) and ON (instance) (pg_replication_is_replica == 0))

Edited by Alberto Ramos