Project 'gitlab-com/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Find a good value for maximum timeout for Shared Runners
With GitLab 10.7, we're releasing the Per runner maximum job timeout feature (gitlab-org/gitlab-ce!17221). We should do some calculations and specify a good value that we could set for:
gitlab-shared-runners-manager-X - these runners should be used only for GitLab CE/EE jobs (from main projects and forks), so we should find out the longest time for a job and add few minutes for a safety margin;
shared-runners-manager-X - these are a general purpose runners available for all users. We should check what are the times of jobs currently running on Shared Runners, what timeouts are set in projects that are using the Shared Runners. Ideally would be to get a histogram of such timings. Having this we could decide what value of maximum timeout is a reasonable compromise.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
I've prepared a code snippet to get data from production DB:
# Based on:# https://stackoverflow.com/questions/11784843/calculate-95th-percentile-in-ruby#answer-11785414#defpercentile(values,percentile)returnnilifvalues.empty?returnvalues[0]ifvalues.size<2values_sorted=values.sortk=(percentile*(values_sorted.size-1)+1).floor-1f=(percentile*(values_sorted.size-1)+1).modulo(1)values_sorted[k]+(f*(values_sorted[k+1]-values_sorted[k]))enddefavg(values)values.inject(0.0){|sum,el|sum+el}/values.sizeenddefsummary(title,dry_run)raise"No block given"unlessblock_given?data=yieldputsputs"### #{title}"putsifdry_runputs" SQL: #{data.to_sql}"putsreturnenddurations=data.map(&:duration)ifdurations.empty?puts" No data returned"putsreturnendputs" count: #{durations.size}"puts" avg: #{avg(durations)}"puts" min: #{durations.min}"puts" max: #{durations.max}"puts" mean: #{percentile(durations,0.5)}"puts" 95th: #{percentile(durations,0.95)}"putsenddefprojects_summary(dry_run)projects_stats=Project.all.select('avg(build_timeout) as avg','max(build_timeout) as max','min(build_timeout) as min')putsputs"### Projects stats"putsifdry_runputs" SQL: #{projects_stats.to_sql}"putsreturnendifprojects_stats.empty?puts" No data returned"putsreturnendputs" avg: #{projects_stats[0]['avg']}"puts" min: #{projects_stats[0]['min']}"puts" max: #{projects_stats[0]['max']}"putsendDRY_RUN=trueGITLAB_CE_EE_TIMEFRAME=1.weekGENERAL_TIMEFRAME=2.daysGITLAB_CE_EE_PROJECTS=[13083,278964]USED_SHARED_RUNNERS=[40786,40788,44028,44949]defce_ee_jobs(status)projects=Project.where(id: GITLAB_CE_EE_PROJECTS)Ci::Build.where(project: projects).where('finished_at > ?',GITLAB_CE_EE_TIMEFRAME.ago.midnight).where('status = ?',status)enddefother_jobs_on_shared_runnersCi::Build.joins(:runner).where(ci_runners: {is_shared: true,id: USED_SHARED_RUNNERS}).where('project_id NOT IN (?)',GITLAB_CE_EE_PROJECTS).where('finished_at > ?',GENERAL_TIMEFRAME.ago.midnight)endsummary("GitLab CE/EE succeeded jobs",DRY_RUN){ce_ee_jobs(:success)}summary("GitLab CE/EE failed jobs",DRY_RUN){ce_ee_jobs(:failed)}summary("GitLab CE/EE cancelled jobs",DRY_RUN){ce_ee_jobs(:cancelled)}summary("Non GitLab CE/EE jobs on Shared Runners",DRY_RUN){other_jobs_on_shared_runners}projects_summary(DRY_RUN)
I've executed this in dry run mode on production DB to see what SQL queries will be generated:
### GitLab CE/EE succeeded jobs SQL: SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."project_id" IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN (13083, 278964)) AND (finished_at > '2018-04-12 00:00:00.000000') AND (status = 'success')### GitLab CE/EE failed jobs SQL: SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."project_id" IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN (13083, 278964)) AND (finished_at > '2018-04-12 00:00:00.000000') AND (status = 'failed')### GitLab CE/EE cancelled jobs SQL: SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."project_id" IN (SELECT "projects"."id" FROM "projects" WHERE "projects"."id" IN (13083, 278964)) AND (finished_at > '2018-04-12 00:00:00.000000') AND (status = 'cancelled')### Non GitLab CE/EE jobs on Shared Runners SQL: SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "ci_runners" ON "ci_runners"."id" = "ci_builds"."runner_id" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_runners"."is_shared" = 't' AND "ci_runners"."id" IN (40786, 40788, 44028, 44949) AND (project_id NOT IN (13083,278964)) AND (finished_at > '2018-04-17 00:00:00.000000')### Projects stats SQL: SELECT avg(build_timeout) as avg, max(build_timeout) as max, min(build_timeout) as min FROM "projects"
I wanted to check how heavy these queries could be, but EXPLAIN ANALYZE is failing with /app/lib/chatops/database/read_only_connection.rb:43:in 'exec': ERROR: canceling statement due to statement timeout (PG::QueryCanceled), when used with our chatops integration in Slack.
The last one still does a sequential scan across all projects:
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=318832.66..318833.16 rows=200 width=44) Sort Key: (ntile(10) OVER (?)) -> HashAggregate (cost=318822.52..318825.02 rows=200 width=44) Group Key: ntile(10) OVER (?) -> WindowAgg (cost=301028.92..309332.60 rows=474496 width=8) -> Sort (cost=301028.92..302215.16 rows=474496 width=4) Sort Key: projects.build_timeout -> Seq Scan on projects (cost=0.00..256293.35 rows=474496 width=4) Filter: (shared_runners_enabled AND (last_activity_at >= '2018-04-01 00:00:00+00'::timestamp with time zone))(9 rows)
If I disable seq scans Postgres thinks it'll be slightly slower:
QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Sort (cost=347219.31..347219.81 rows=200 width=44) Sort Key: (ntile(10) OVER (?)) -> HashAggregate (cost=347209.17..347211.67 rows=200 width=44) Group Key: ntile(10) OVER (?) -> WindowAgg (cost=329444.11..337734.47 rows=473735 width=8) -> Sort (cost=329444.11..330628.45 rows=473735 width=4) Sort Key: projects.build_timeout -> Bitmap Heap Scan on projects (cost=80714.10..284785.77 rows=473735 width=4) Recheck Cond: (last_activity_at >= '2018-04-01 00:00:00+00'::timestamp with time zone) Filter: shared_runners_enabled -> Bitmap Index Scan on index_projects_on_last_activity_at (cost=0.00..80595.66 rows=481014 width=0) Index Cond: (last_activity_at >= '2018-04-01 00:00:00+00'::timestamp with time zone)(12 rows)
How frequently is this query going to be run? If it's once a day it's not really a problem but if it's on every job or every five minutes this isn't the direction we want to go in...
How frequently is this query going to be run? If it's once a day it's not really a problem but if it's on every job or every five minutes this isn't the direction we want to go in...
@_stark It's one-time query. We need to get data to decide what is the best value for a timeout that we will set on Shared Runners :)
QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=732181.52..732307.04 rows=772 width=59) Group Key: stats.status, stats.quantile -> Sort (cost=732181.52..732200.83 rows=7725 width=27) Sort Key: stats.status, stats.quantile -> Subquery Scan on stats (cost=731450.92..731682.67 rows=7725 width=27) -> WindowAgg (cost=731450.92..731605.42 rows=7725 width=27) -> Sort (cost=731450.92..731470.23 rows=7725 width=23) Sort Key: ((ci_builds.finished_at - ci_builds.started_at)) -> Bitmap Heap Scan on ci_builds (cost=182716.13..730952.06 rows=7725 width=23) Recheck Cond: (((status)::text = ANY ('{success,failed,canceled}'::text[])) AND ((type)::text = 'Ci::Build'::text) AND (runner_id = ANY ('{37397,37398,157328,157329}'::integer[])) AND (project_id = ANY ('{13083,278964}'::integer[]))) Filter: (finished_at >= '2018-04-12 00:00:00'::timestamp without time zone) -> BitmapAnd (cost=182716.13..182716.13 rows=397517 width=0) -> Bitmap Index Scan on index_ci_builds_on_status_and_type_and_runner_id (cost=0.00..88063.98 rows=3599615 width=0) Index Cond: (((status)::text = ANY ('{success,failed,canceled}'::text[])) AND ((type)::text = 'Ci::Build'::text) AND (runner_id = ANY ('{37397,37398,157328,157329}'::integer[]))) -> Bitmap Index Scan on index_ci_builds_on_project_id_and_id (cost=0.00..94648.04 rows=7023388 width=0) Index Cond: (project_id = ANY ('{13083,278964}'::integer[]))
They're using the index on project_id as well as the index on status,type,runner_id however as they're separate indexes it has to do scan them separately and then intersect the results. This is compounded by the fact you're accessing multiple statuses and multiple runner_id so the scan on that index is also not really as effective as it could be.
If you could restrict it to a single status, perhaps get separate statuses for each status that might be better. And you might try doing the two projects separately too.
We may have to create an index on project_id,runner_id where status in (success,failed,canceled) and type = Ci::Build. That would let you do each project indivdiually quite efficiently.
If I haven't messed anything in the queries, it looks like more than at least 90% of jobs on GitLab.com's Shared Runners are executed in less than 6 minutes!
I think I'll add standard deviation fields to these queries to check what are the extremes in each of the quantiles.
I've needed to update the Query for project build_timeouts since most of the projects are using default 3600! I've additionally checked the histogram for projects with build_timeout < 3600 and executed another query for build_timeout > 3600. After this I've executed these three queries once again, but changed date requirement to last_activity_at >= '2018-01-01 00:00'
First table counts all projects with shared_runners_enabled = t and active since 2018-01-01. Something between 99% and 100% of such projects uses a timeout that is lower or equal than 1h.
The second table is the histogram that counts the same projects, but filtered with build_timeout > 3600, so the last 1% of the previous histogram. Between 54% and 55% of such projects uses a timeout that is lower than 3h. This means that something between 99.54-99.55% of all active projects are using a timeout lower than 3h.
After a discussion we've decided that 3h timeout for Shared Runners would be a good compromise. It's 3 times more than most of currently active projects are using and still feels to be reasonable.
To apply this we should:
find which projects will be affected when we'll set the 3h timeout,
notify owners of such projects, specifying when we'll update the timeout,
apply the timeout to Shared Runners.
I'll create a separated issue to track this (done: #4070 (closed)).
As for GitLab CE/EE jobs, 99% of them succeeds in less than 30 minutes. The maximum duration of a succeeded job (I've counted jobs since 2018-04-01) was 1h40min.
We should check:
why some of the jobs were so much longer than most of them,
what causes 95% of failed jobs to run for longer than 1h
decide what is the best timeout. As for now 45min seems to be reasonable, yet have still some additional free space for jobs that are longer than the usual <30min.
@stanhu@rymai Since you've been involved in making the pipelines faster, with our high-cpu machines history and other improvements in .gitlab-ci.yml configuration, I'd like to hear your voice about such timeout :)
@tmaczukin Awesome data. 45 minutes seems like it doesn't give a lot of leeway if on average most jobs finish in 38 minutes. Most of the time it's a bit of a mystery why the job took so long: it could be a slow VM, slow clone on GitLab.com, many failed retries in the specs, etc.
Also, isn't it better to give a job 15 more minutes rather than retry and fail again? Another words, would we be more comfortable setting it at 60 minutes to start?
@tmaczukin Given the data, I guess 45 minutes should be reasonable, but I don't have a strong opinion regarding 45 vs 60 minutes to start with. This is a two-way door decision anyway, so we can just pick one and look back to the data again in a few weeks.