Ci::Runner#can_pick? query for optimisation and review execution frequency
Problem
After investigating spikes over 70% of CPU usage on the database:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12116
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12106
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12095
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12080
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12055
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12054
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11944
we find out this statement being executed thousands of times:
QueryId: -2372450153195223637
SELECT $1 AS one FROM ((SELECT "ci_runners".* FROM "ci_runners" INNER JOIN "ci_runner_projects" ON "ci_runner_projects"."runner_id" = "ci_runners"."id" WHERE "ci_runner_projects"."project_id" = $2)
UNION ALL
(SELECT "ci_runners".* FROM "ci_runners" INNER JOIN "ci_runner_namespaces" ON "ci_runner_namespaces"."runner_id" = "ci_runners"."id" INNER JOIN "namespaces" ON "namespaces"."id" = "ci_runner_namespaces"."namespace_id" AND "namespaces"."type" = $3 WHERE "namespaces"."id" IN (WITH RECURSIVE "base_and_ancestors" AS ((SELECT "namespaces".* FROM "namespaces" INNER JOIN "projects" ON "projects"."namespace_id" = "namespaces"."id" WHERE "namespaces"."type" = $4 AND "projects"."id" = $5)
UNION
(SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."type" = $6 AND "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT "namespaces"."id" FROM "base_and_ancestors" AS "namespaces"))
UNION ALL
(SELECT "ci_runners".* FROM "ci_runners" WHERE "ci_runners"."runner_type" = $7)) ci_runners WHERE "ci_runners"."id" = $8 LIMIT $9 /*application:web,controller:jobs,action:show,correlation_id:sqBrZDhW3F8*/
Proposal
This query originates from Ci::Runner#can_pick?
. It is being used in a few places, in a RegisterJobService
and while calculating if a build is stuck.
Removing this query from the RegisterJobService
might be risky and difficult, but it is possible that the main source of the load is the def any_runners_online?
method presented below. We use it to show "stuck" badge next to a build that can not be picked by runners.
I suggest adding a feature flag to this method, to exclude can_pick?
method call to run an experiment. It will allow us to validate that this is the root cause. From there we will be able to work with product and engineers on improving performance.
app/models/ci/build.rb
713: def any_runners_online?
714- project.any_runners? { |runner| runner.active? && runner.online? && runner.can_pick?(self) }
715- end
716-
717- def stuck?
718: pending? && !any_runners_online?
719- end
Prometheus dashboard
Prometheus dashboard for this query on the primary database
PromQL:
sum(rate(pg_stat_statements_calls{queryid=~"-2372450153195223637", env="gprd", monitor="db", type="patroni"}[1m]) and ON (instance) (pg_replication_is_replica == 0))