Skip to content

/admin/runners page fails with an SQL Statement Timeout

Summary

/admin/runners page is failing with 500 (internally caused by SQL statement timeout) on large installations like GitLab.com. This is caused because for each runner - each row in the table - GitLab is querying the total count of jobs handled by this Runner:

.table-section.section-5
  .table-mobile-header{ role: 'rowheader' }= _('Jobs')
  .table-mobile-content
    = runner.builds.count(:all)

With a larger number of entries in ci_builds table (as it is at GitLab.com) such queries are failing with a timeout, e.g.:

ActionView::Template::Error: PG::QueryCanceled: ERROR:  canceling statement due to statement timeout: SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = XXXXXX

which for GitLab.com can be found at https://sentry.gitlab.net/gitlab/gitlabcom/issues/535548/ (internal only).

Steps to reproduce

Chose any GitLab installation with a significant number of jobs stored in ci_builds table (e.g. GitLab.com), login with an admin account and open /admin/runners page. There is a big chance, that you will see 500 error page instead of the list of Runners.

What is the current bug behavior?

GitLab returns 500 page, caused by a SQL Statement Timeout. For GitLab.com (and probably other installations that are heavily using GitLab CI) this page is currently not usable at all.

What is the expected correct behavior?

GitLab renders list of all runners.

Relevant logs and/or screenshots

Runner 44028 bellow is one of Shared Runners on GitLab.com:

gitlabhq_production=> SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = 44028;
ERROR:  canceling statement due to statement timeout
gitlabhq_production=> EXPLAIN SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = 44028;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2493520.30..2493520.31 rows=1 width=8)
   ->  Index Only Scan using index_ci_builds_on_status_and_type_and_runner_id on ci_builds  (cost=0.57..2482128.55 rows=4556699 width=0)
         Index Cond: ((type = 'Ci::Build'::text) AND (runner_id = 44028))
(3 rows)

gitlabhq_production=> EXPLAIN ANALYZE SELECT COUNT(*) FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."runner_id" = 44028;
ERROR:  canceling statement due to statement timeout

As it can be seen, the query is timing out even when the index is being used. I was unable to get the output of EXPLAIN ANALYZE.

Queries are executed to feed the column marked on the screenshot bellow:

Screenshot_2018-10-30_Admin_Area

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

GitLab.com

Results of GitLab application Check

GitLab.com

Possible fixes

Looking on that page I'm not sure if the number of executed jobs is really so important number. GitLab doesn't print it for Project or Group runners, where Project or Group settings are entered. I know, it's a nice stat and it's awesome to look on the list and see that Runners on managed GitLab installation are handling thousands or even millions of jobs. But at this moment, existence of this column makes the whole runner's dashboard unusable.

Since we already have an index on ci_builds.runner_id and the query is timing out even when using this index, I think the only solution is to remove this column.

Edited by Tomasz Maczukin