The interruptible keyword allows GitLab to stop pipelines that are out of date, in order to save compute costs. I took a quick glance at our CI files, and couldn't see this being utilized.
For merge requests, cancel automatically pipelines except the latest one. If people want to not protect pipelines from being canceled, they can play the dont-interrupt-me manual job.
For master, the dont-interrupt-me starts automatically as we want no master pipelines to be canceled.
That's a good idea. However, the way I understand it, it only interrupts a pipeline that is in the pending state. In my experience, our runners are very fast to pick up jobs, hence pipelines doesn't stay long in this state.
@rymai From what I understand from the documentation, interruptible can mark jobs running as if they're pending (i.e. interruptible), therefore if we set interruptible to all jobs, we can actually make it cancel any running pipelines!
I feel a bit excited for this, except that it might be a bit painful to apply this to all jobs. I wish there's a way to do this globally, but I guess not now.
I thought it should be doing exactly what I was describing, but actually they're different. Setting interruptible to all jobs means only the latest pipeline will run, therefore if we keep pushing it'll never finish.
What I was describing was actually waiting until the first running pipeline is done though, therefore if we keep pushing the old pipeline will still finish, and the latest pipeline will be skipped.
So they're cancelling the exact opposite of pipelines. Both have advantages and disadvantages, so I think we can just do what's easier to do first, and setting interruptible is probably easier and also Dogfooding.
Hmm, was the documentation improved recently or did I really misread it entirely? 😛 Anyway, that's a great news then!
It's probably much improved, because it seems pretty clear to me :P
So they're cancelling the exact opposite of pipelines. Both have advantages and disadvantages, so I think we can just do what's easier to do first, and setting interruptible is probably easier and also Dogfooding
I think this is useful for MRs where the new commits are controlled (compared to master which always gets new merges), but the question is: is it possible to have jobs interruptible on MRs but not on master?! 🤔
I think this is useful for MRs where the new commits are controlled (compared to master which always gets new merges), but the question is: is it possible to have jobs interruptible on MRs but not on master?! 🤔
😞 Feels like we're back to duplicating the jobs, if we want them to act differently on master and merge requests, as this doesn't seem to be supported by rules.
Sad. I do think we should not add interruptible on master, otherwise we won't see green pipelines during work days.
Now maybe adding an extra job in the first stage for the merge request, might be easier after all...
Actually! We have a workaround. Look at this in the documentation:
Note: Once an uninterruptible job is running, the pipeline will never be canceled, regardless of the final job’s state.
For the master pipeline, we can insert a new job in the first stage which is not interruptible! So master pipelines will unlikely to be interruptible because that's the first dummy job it'll run.
For merge requests, we don't have this job, so the whole pipeline should be interruptible.
@thaoyeager is PM for CI so would be best able to help here, or can loop in the right folks.
@thaoyeager - just wanted to include you, so you could see the discussion here and perhaps provide some input. It looks like we may have to implement a relatively hacky workaround so that jobs aren't interrupted on master but are on dev branches. (We wouldn't want to abort a deploy to the website, for example, each time a new pipeline was scheduled)
As an aside - I am curious how Merge Trains and this keyword interact.
@joshlambert If we need to an interim solution immediately for cost savings, I think we should proceed with @godfat's suggestion in #210058 (comment 305693536) to allow pipeline to always run uninterrupted on master; and notably this hack of adding a single job seems easy to remove once when we have a more elegant solution, whether it's by implementing #209864 (closed) or #211449 or something else.
@ogolowinski It appears Allow only forward deployments ensures the most current deployments is not overwritten by a less current deployment, whereas this issue is to abort running of a pipelines on out-of-date commits (to avoid cost of unnecessarily running a pipeline at $4 per pipeline), so these appear to be different issues.
The reason I pinged you was to ask of the impact on merge trains when pipeline is aborted after being added to the train.
@thaoyeager there are two use cases in which an MR is dropped from a train:
Cancel means a user removed a merge request from a merge train (e.g. user realized that the MR had a bug).
Abort means the merge request was removed from a merge train by the system (e.g. there was a merge conflict and requires user interaction to resolve it).
In both cases it doe not automatically retry the MR - the user needs to do so manually. see #12136 (closed)
Awesome work - thanks everyone! I opened a follow up for the www-gitlab-com side which sees a lot of "suggestion-style" workflows, but will probably have a lower impact/priority due to reduced CI pipeline cost: gitlab-com/www-gitlab-com#6995 (closed)
Team, we should evaluate if we set out to achieve the goal below.
Yeah, there is a "Validation" part in the issue to verify the impact of this change: #210058 (closed), but I wouldn't say this particular number is the goal. I'm curious @joshlambert how you came up with the 50-60K$ number? The total for March is 68K
Yeah, there is a "Validation" part in the issue to verify the impact of this change: #210058 (closed), but I wouldn't say this particular number is the goal. I'm curious @joshlambert how you came up with the 50-60K$ number? The total for March is 68K
,hencesaving50−60K
out of it seems a lot?
@rymai - to your point it's not going to save us that much. I think I was just referencing the total cost of the pipelines, rather than how much we could save.
@davis_townsend - Originally Engineering Productivity was expecting some cost savings but there may not but much with the low amount of cancelled jobs (compared to the total amount of jobs). Have you been able to identify a cost savings for gitlab-org/gitlab pipelines since !27611 (merged) was merged on 2020-03-27? I'm not sure if there's charts or data that we are not aware of which may provide better insight than above charts do.
I've been identified that the total_build_duration for cancelled pipelines since 2020-04-01 is on average about 16% of non-cancelled pipelines (25565.944257/156086.162524) and represents about 17% of all pipeline executions over the same timeframe: 1222/(1222+5674)
My hypothesis was a savings of about $300 per day using the build execution duration. This was from assuming the cost per job of $0.025 and a conservative estimate of increase of cancelled jobs per day of 12,000 when using similar calculation methods as the cost per job (runner price * execution timeframe).
I realize there's some missing items from assumptions (when is a job cancelled, what type of runner is it on) but a range of $150-$300 per day seems reasonable based on the cost per job and amount of cancelled jobs.
The queries I was using for the total_build_duration analysis are below:
with cancelled_pipeline_build_durations as ( SELECT pipes.ci_pipeline_id, blds.ci_build_name, datediff('seconds', blds.started_at, blds.FINISHED_AT) as build_duration FROM analytics.gitlab_dotcom_ci_builds blds join analytics_staging.GITLAB_DOTCOM_CI_STAGES stgs ON stgs.ci_stage_id = blds.ci_build_stage_id join analytics_staging.GITLAB_DOTCOM_CI_PIPELINES pipes ON pipes.ci_pipeline_id = stgs.pipeline_id WHERE ci_build_project_id = 278964 AND pipes.status = 'canceled' and date_trunc('day', blds.CREATED_AT)::date >= '2020-04-01'), completed_pipeline_build_durations as ( SELECT pipes.ci_pipeline_id, blds.ci_build_name, datediff('seconds', blds.started_at, blds.FINISHED_AT) as build_duration FROM analytics.gitlab_dotcom_ci_builds blds join analytics_staging.GITLAB_DOTCOM_CI_STAGES stgs ON stgs.ci_stage_id = blds.ci_build_stage_id join analytics_staging.GITLAB_DOTCOM_CI_PIPELINES pipes ON pipes.ci_pipeline_id = stgs.pipeline_id WHERE ci_build_project_id = 278964 AND pipes.status != 'canceled' and date_trunc('day', blds.CREATED_AT)::date >= '2020-04-01'), aggregated_build_durations_by_pipeline as ( select ci_pipeline_id, sum(build_duration) as total_build_duration from cancelled_pipeline_build_durations GROUP BY 1), completed_aggregated_build_durations_by_pipeline as ( select ci_pipeline_id, sum(build_duration) as total_build_duration from completed_pipeline_build_durations GROUP BY 1)SELECT 'cancelled' as pipeline_status, AVG(total_build_duration), count(*) as number_of_pipelines from aggregated_build_durations_by_pipelineUNIONSELECT 'completed' as pipeline_status, AVG(total_build_duration), count(*) as number_of_pipelines from completed_aggregated_build_durations_by_pipeline
@kwiebers There's a couple things I'm still investigating, but may be interested in some of this in the meantime, I can't seem to get my job count to line up with the job count from the query provided, so I need to find out why that's the case. Something else I noticed in the data is that some cancelled jobs don't have started_at times, which could affect the duration calculations a little bit, although assuming these should have duration 0, then that is correct today.
So we do see minor changes to failed and canceled jobs which is expected, although it doesn't seem to be a huge difference. You can see the job count of canceled jobs has gone up by about 4k from that graph, although the avg ci build duration for canceled jobs has not gone down at all, which I assumed would be the case. Also the job count of failed jobs has been going up also so I'm curious if some of these are being counted as failed jobs
Something else I noticed in the data is that some cancelled jobs don't have started_at times, which could affect the duration calculations a little bit, although assuming these should have duration 0, then that is correct today.
@davis_townsend Thanks for looking into it! I think this is actually the crux of the problem here. I've ran some experiments:
I queried two jobs:
One with a NULLstarted_at, with an empty duration, and an empty estimated cost
One with a non-NULLstarted_at, with a duration of 895 (seconds), and an estimated cost of 0.023618045
We would expect an AVG(duration) of 895 / 2 = 447.5, and AVG(estimated_cost) of 0.023618045 / 2 = 0.011809022500 right? Well, no, AVG(duration) actually returns 895, and AVG(estimated_cost) returns 0.023618055556!
Thus, basically the canceled jobs without a started_at aren't taken in account in the AVG!
however, the reason they aren't being counted is because there is no runner_id for these jobs since they never get picked up, so they get excluded from the join results. This means the cost number should technically not be affected but as mentioned the average and counts are.
But the weird thing, is we see the increase of about 5k jobs to 20k jobs that never get started, yet I don't see the corresponding drop in job count of successful or failed jobs, which I would have expected, so it's like there are more un-started jobs, but there was also an equal increase in number of jobs in general that made up for it, unless I am missing something here.
@davis_townsend Also, I think datediff(second,builds.created_at,builds.finished_at) isn't correct because jobs are created as soon as their pipeline is created, but depending on which stage they're on, they could start only after many minutes!
I think a better way to calculate a job duration is coalesce(datediff(second,builds.started_at,builds.finished_at),0) which should coalesce missing duration (e.g. for jobs that are canceled before they even started) to 0, so that they're taken in account in averages etc.
using what you mentioned which have similar results to what you have. My only point was that although it does seem to be in a downtrend, it's hard to tell whether this is just fluctuations due to other changes or this change in particular, as it seems to have started before Mar 27th (on Mar 23).
But if we assume it's all due to this change then it seems to be about 10K more cancelled jobs per day on avg (which are mostly taken from the successful jobs pool), and we can take the difference in avg cost per job of each status x the number of jobs to estimate the saved cost, so it'd be 10,000 jobs X ($.026/job - $.003/job) = $230/day, or ~$84K/year.
I looked at the billing console data as well, but it's hard to see any change since it's a smaller portion of the cost and it may have increased from general usage during the same time period. Without being able to directly tie this portion of usage back with the billing data, this estimation is probably the best we can do for this.
Week of Mar 9th we average at 6,800 cancelled jobs per weekday
Week of Apr 6th we average at 22,600 cancelled jobs per weekday
Difference is around (22,600-6,800) = 15,800 cancelled jobs per weekday.
Calculating per year 5 weekdays per week and 52 weeks per year, 15,800(jobs) * 5(days) * 52(weeks) * .026 ($ cost per job) = 106,808 per year.
So this is roughly in-line with @davis_townsend projection above. With both calculations methods pointing to the ballpark of ~$84K - ~$100K savings per year.
I recommend going with the conservative side until we get better data on trends.
however, the reason they aren't being counted is because there is no runner_id for these jobs since they never get picked up, so they get excluded from the join results. This means the cost number should technically not be affected but as mentioned the average and counts are.
However, for jobs that are canceled after being started, they would impact the average, which may or may not be what we want as I think ideally the average job cost should reflect the average cost for succeeding (and potentially failed jobs as well) jobs only.
@davis_townsend@joshlambert I think we should be good with confirming that leveraging interruptible in the main gitlab project is saving us $84K-$100K per year or $7,000-$8,300 per month. We have progress and will continue at identifying other areas for cost efficiency.