/admin/runners page fails with an SQL Statement Timeout
Summary
/admin/runners
page is failing with 500 (internally caused by SQL statement timeout) on large installations like GitLab.com. This is caused because for each runner - each row in the table - GitLab is querying the total count of jobs handled by this Runner:
.table-section.section-5
.table-mobile-header{ role: 'rowheader' }= _('Jobs')
.table-mobile-content
= runner.builds.count(:all)
With a larger number of entries in ci_builds
table (as it is at GitLab.com) such queries are failing with a timeout, e.g.:
ActionView::Template::Error: PG::QueryCanceled: ERROR: canceling statement due to statement timeout: SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = XXXXXX
which for GitLab.com can be found at https://sentry.gitlab.net/gitlab/gitlabcom/issues/535548/ (internal only).
Steps to reproduce
Chose any GitLab installation with a significant number of jobs stored in ci_builds
table (e.g. GitLab.com), login with an admin account and open /admin/runners
page. There is a big chance, that you will see 500
error page instead of the list of Runners.
What is the current bug behavior?
GitLab returns 500 page, caused by a SQL Statement Timeout. For GitLab.com (and probably other installations that are heavily using GitLab CI) this page is currently not usable at all.
What is the expected correct behavior?
GitLab renders list of all runners.
Relevant logs and/or screenshots
Runner 44028
bellow is one of Shared Runners on GitLab.com:
gitlabhq_production=> SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = 44028;
ERROR: canceling statement due to statement timeout
gitlabhq_production=> EXPLAIN SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = 44028;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2493520.30..2493520.31 rows=1 width=8)
-> Index Only Scan using index_ci_builds_on_status_and_type_and_runner_id on ci_builds (cost=0.57..2482128.55 rows=4556699 width=0)
Index Cond: ((type = 'Ci::Build'::text) AND (runner_id = 44028))
(3 rows)
gitlabhq_production=> EXPLAIN ANALYZE SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = 44028;
ERROR: canceling statement due to statement timeout
As it can be seen, the query is timing out even when the index is being used. I was unable to get the output of EXPLAIN ANALYZE
.
Queries are executed to feed the column marked on the screenshot bellow:
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
GitLab.com
Results of GitLab application Check
GitLab.com
Possible fixes
Looking on that page I'm not sure if the number of executed jobs is really so important number. GitLab doesn't print it for Project or Group runners, where Project or Group settings are entered. I know, it's a nice stat and it's awesome to look on the list and see that Runners on managed GitLab installation are handling thousands or even millions of jobs. But at this moment, existence of this column makes the whole runner's dashboard unusable.
Since we already have an index on ci_builds.runner_id
and the query is timing out even when using this index, I think the only solution is to remove this column.