It's possible to want a after_script step for a gitlab-ci.yml, where it's important that it always run. However, the current behavior of after_script is that if a build times out the cleanup stage is skipped (even though this does not appear to be the original intent: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/102#note_11619214).
Users may be able to make use of the Job Event webhook to shut down or cleanup external environments. It should be assumed that no data outside the webhook payload will be available unless explicitly stored in global variables or some other method because the runner may have lost connectivity or otherwise may not be accessible to provide data required for the desired action to be taken.
Further details
Permissions and Security
Documentation
Availability & Testing
What does success look like, and how can we measure that?
What is the type of buyer?
Is this a cross-stage feature?
Links / references
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@markglenfletcher: When I tested when: on_failure it didn't run that stage for me. I had 2 stages a build and a build_cleanup with on_failure. The 2nd stage didn't trigger when I cancelled.
@ayufan@zj I don't see output from after_script in my logs when I cancel. Do I need a newer GitLab CI runner? EDIT: I'm running the latest runner via Ubuntu apt-get
stages:- testtest: stage: test script: - sleep 120 after_script: - echo cleaning now
When I run normally, I do see 'echo cleaning now'. However, when I cancelled I expected to see the echo cleaning now text at the bottom of my logs.
Update: I don't think after_script: is working as intended.
If I understood well, the 'after_script' concept is the analogue of 'before_script'. It is global, it does not belong to a job, and you cannot apply tags to it as you can do with jobs.
So, the indentation in your example is semantically misleading.
My experience was that the after_script commands were executed after every job (similarly to before_script). Therefore, it is not suitable for clean-up if you have a build and a test stage, because it will run after the build stage, and the test stage will not find anything to test.
@zj is it possible that after_script is running at the right time, but its output isn't being written to the logs on the UI? My jobs are indeed being cleaned up when I cancel, even though the output isn't logged anywhere.
@mespakbefore_script can also be per job. It over-rides the global before_script if defined. Again, I'm not sure if that's intended behaviour or not.
@zj is it possible that after_script is running at the right time, but its output isn't being written to the logs on the UI? My jobs are indeed being cleaned up when I cancel, even though the output isn't logged anywhere.
@mespakbefore_script can also be per job. It over-rides the global before_script if defined. Again, I'm not sure if that's intended behaviour or not.
+1
In our case, we use kitchen ci with kitchen-docker, and for each stage in kitchenci, we have the corresponding stage in gitlab-ci.yml.
When we cancel the pipeline, the command kitchen destroy is never called.
Then we have to destroy manually all the docker instances created by kitchenci.
We cannot apply an after_script after each stage and destroy to recreate/reconverge(kitchen create+converge)/resetup(create+converge+setup)/reverify(create+converge+setup+verify) and finaly destroy... The time duration of the pipeline will become exponential.
To chime in on the canceling -- it could be for a multitude of reasons. We have jobs that are a part of deployments that would want some clean up when canceling.
We would benefit from this cleanup. So a use case is when a job is already erroring, and it's going to take a long time for it to finish, then we usually cancel it. The problem is that the cancellation does not trigger the after_script, and therefore cleanup does not take place.
The after_script doesn't appear to run on cancels from UI.
Also, what's with the zendesk URLs? I get "oops The page you were looking for doesn't exist" errors for both of them. If there's additional relevant info, can we keep it in this ticket and accessible by all?
I typically press cancel in the UI to free up resources from a build I know is going to fail, so that I can use them for the next build which includes a fix.
If I remember correctly, manual jobs can be executed also if a pipeline has been canceled. A possible way to overcome the problem is to make the cleanup job manual (yes, probably you need also another one for cleanup if you don't cancel).
After cancelling a pipeline, you can run the cleanup job manually.
This is not a solution, but may work until it will be implemented.
Agreed that an additional job with when: on_cancel wouldnt solve the problem because the additional job may land on a different runner. Often times the desire for cleanup involves resources on the host that ran the cancelled job. Something similar to after_script would definitely solve the problem for us.
My specific use case is for docker containers, networks, and volumes that will get orphaned if a job is cancelled and then cause problems on future jobs that run on that host.
This feature would be very useful. see gitlab-ce#34861.
This would be great. I kick off a bunch of parent pipelines when a commit is made to a submodule, and it would be good to be able cancel those if the submodule CI is canceled. An on_cancel: option would enable this.
I think the real issue here is that the after_script doesnt actually fire on a timeout or cancellation. I think this should really a bugfix issue making after_script actually always work as an after_script instead of being only ran in certain cases. an environment variable could be set so that the after script can know the status of the build. (success, fail, cancel, timeout)
I'm having an issue with a job freezing up and noticed that when a job is terminated from timing out (1 hour default) the after_script did not get fired then either. Over time canceled and timed-out jobs consume a lot of resources and have to be manually cleaned up.
Canceled/timeouted jobs should call after_script, or a to-be-introduced finally_script. CC @ayufan@markpundsack@bikebilly In fact, finally_script for canceled/timeouted jobs, was promised as part of 8.7 CI Plan https://gitlab.com/gitlab-org/gitlab-ce/issues/14717. It never made its way to GitLab. It's really needed. (We're EEP customers)
There is currently no way to run docker run inside a job, and having these containers nuked on job cancel or timeout reliably. Our job definition is something like this:
Since after_script won't run on timeout/cancel, docker stop won't run.
I also passed --inithoping that GitLab CI would SIGTERM the processes inside the container. --init passes signals to its children, effectively allowing the test runner to react to ^C or SIGTERM by exiting, which /bin/init spots, and stops, stopping the container. Unfortunately, gitlab-runner SIGKILLs (9) its job containers, therefore all children are SIGKILL-ed too (remember, SIGKILL, unlike all other signals, is not a signal, the kernel kills a process right away without letting it know). docker run process, a Docker client process, will get SIGKILLed too. No signal will be passed to the underlying bash -c 'prove -v -I perl -j5 -r t/perl/unit' in a container, because that container is NOT a child of job container.
root@gitlab-ci-runner-3:~# docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES3dd44c50639c git.dreamhost.com:5001/dreamhost/ndn/ndn-base:build-8175 "bash -c 'mkdir $DBH…" 4 minutes ago Up 4 minutes ecstatic_mccarthy38c30ccccf58 5d152d55326b "docker-entrypoint.s…" 6 minutes ago Up 6 minutes runner-f2f196b4-project-8-concurrent-0-build-4
root@gitlab-ci-runner-3:~# docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES3dd44c50639c git.dreamhost.com:5001/dreamhost/ndn/ndn-base:build-8175 "bash -c 'mkdir $DBH…" 13 minutes ago Up 13 minutes ecstatic_mccarthy
As we can see, my explanation matches the visuals. GitLab runner SIGKILLs the job container and its children (docker run client process) and that's it. No real signal (e.g. SIGTERM or SIGINT) is ever passed to docker run process, making it unable to inform the init process of my container that it should stop working.
Lack of finally_script and SIGTERM-first is a serious flaw, making it impossible to use Docker-in-Docker reliably.
tl;dr with action points
GitLab Runner doesn't run after_script on timeout/cancel, making it impossible to stop Docker-in-Docker (via docker.sock mount) containers manually using docker stop.
Action point: implement finally_script, or make after_script run on cancel/timeout.
GitLab Runner uses SIGKILL, which makes it impossible for Docker-in-Docker (via docker.sock mount) containers to terminate gracefully when asked to.
Action point: SIGTERM your job container first, SIGKILL it after 15 seconds or so if no reaction.
Action point: start GitLab job containers with --init so SIGTERM signals are passed to children. (I think it's already a case, docker inspect [containerid] of a container that looks like runner-f2f196b4-project-8-concurrent-0-build-4 says Init: true).
.
Implementing (1) (finally_script) makes it possible to manually terminate DIND containers using docker stop command. This is a full solution that always works IF a user is aware they have to always docker stop in finally_script. It's not a common knowledge. But it's a technically complete solution.
Implementing (2) (SIGTERM-first approach) makes it possible to terminate DIND containers that are executed with --init switch, and where internal processes react to signals. In general, the community isn't well-informed about the --init so not everyone uses it, but many do. Also, while tests runners typically respond to signals and terminate gracefully, not everything does - by design, coding incompetence (lol), or due to a hung process.
All in all, is only (1) implemented, or both (1) and (2), every place in GitLab documentation that discusses DIND should provide a guidance on using docker stop in finally_script, and using --init for convenience.
@ayufan@markpundsack@bikebilly We're EEP customers and need at least (1) really bad. This wasn't until yesterday/today when I realized why our GitLab runners have been unreliable for us for months. And why I observed 100% CPU usage across 8 cores on two runners multiple times, while other runners (we have 6 in total) were completely idle. Now we know the reason: jobs are scheduled to runners based on how many jobs a particular runner is running. Not by CPU or memory usage of the host. Some runners can be overloaded with 100% CPU due to some unit tests being run in DIND that were not terminated on cancel/timeout. But the runner thinks there's no jobs running, and it's the best candidate to pick up the job. Then people get frustrated a job that normally takes 20 minutes is taking 90 minutes, cancel it, rerun, and what's happening is the retried job gets scheduled to that "perfect" unoccupied runner, that is in fact terribly overloaded. Other jobs timeout, but the DIND containers remain, and it's just a disaster.
I also wanted this feature for my use case wherein one of the black box testings we spin up a set of docker compose which needs to be cleaned up after it is done whether it is a success, failed, or cancel. And it will become a PITA for my team to manually delete the stuff in the CI machine.
Right now we have a fix for this by:
Suppose I have a job called blackbox, then
I create a when: manual job called cancel_blackbox in the first stage of my build.
stages:-cancel_blackbox-blackbox
So, when I need to cancel, I run the cancel_blackbox instead of clicking the default cancel button in the CI build log UI. It may fix the problem you have @abuerer.
Actually, I love this product and I want to help implement this feature on the CE version if possible. Whom can I talk to start this feature?
We give a deadline of likely 10s, when we send SIGKILL and you can do a graceful shutdown. Making traps work it will allow you to use of graceful shutdown. Maybe SIGKILL gonna be configurable with some limit, like up to 5 minutes, but by default be 10s.
I've gone through and updated the description of this issue with details from the comments here and elsewhere. Please take a look if you're interested in this issue and jump into the discussion if you have any feedback.
There is a second issue discussed here related to GitLab's use of SIGKILL in Docker in Docker scenarios. I feel this is a separate issue so I have not included it in the resolution for making after_script work in this issue. I'm open to hearing if they are in fact fundamentally tied together.
I need this a lot. i have some jobs to view server logs, but there's no interactive interface to stop the log, and there's only the Cancel button. But after pressing cancel button, the log is still running on my server. And more importantly, the gitlab-runner gets stuck with the SSH tunnel and no more jobs can be run, I had to run sudo gitlab-runner restart usually.
If there were the final_script, I could clean up the logging process. Presently, I have to use sleep on a background process to kill the log after a short period of time.
That said, every person who uses DIND with socket as volume will have to be aware of having to use docker stop in after/finally, or risk a "leak". And GitLab documentation related to DIND should reflect that. Which is probably more work than simply implementing SIGTERM-wait 30 seconds-SIGKILL approach. There's an MR for it, and while it took a weird approach to achieve it, it's a proof a code change is extremely minimal. It's a waste of time to not go for this feature! :) gitlab-runner!987 (diffs)
@Nowaker I understand that that's an issue, but as best I can tell it's a separate issue vs. the one this issue is intended to resolve (that after_script or something like it is not called on timeout or cancel, regardless of using Docker in Docker or not.) Or am I misunderstanding?
Using Kubernetes executors, a pod with build, helper and service containers continues to work if we cancel the job. That pod needs to get killed as well.
This is an important issue, but given limited capacity on the team we have a number of other gitlab-ce3857523 items to work through before we move on to gitlab-ce3857529a. Moving out to %12.0 for now.
This is really needed to deal with PR's and deploying tests to Kubernetes. Otherwise, a user can cancel things and tie up resources indefinitely. We're seeing folks adopt Jenkins over Gitlab due to Jenkins being able to safely always delete temporary resources during cancel.
I'm also looking for this (or something similar). My use case is that I'm provisioning infrastructure (say VMs or Packet.net servers via Terraform) to run my CI. That costs money and I want to be absolutely sure the infrastructure is removed when the CI job completes (except for some extreme scenarios).
after_script seems like a very basic way to achieve this.
Azure Pipelines has a very powerful syntax where you can divide your build in different steps and specify conditions for each step. This allows you to create steps which must always run, steps which should only run on failure, etc.
It would be great to see something similar in GitLab.
Azure Pipelines has a very powerful syntax where you can divide your build in different steps and specify conditions for each step. This allows you to create steps which must always run, steps which should only run on failure, etc.
It would be great to see something similar in GitLab.
I believe our model for this will be started with gitlab-ce#47063
Azure Pipelines has a very powerful syntax where you can divide your build in different steps and specify conditions for each step. This allows you to create steps which must always run, steps which should only run on failure, etc. It would be great to see something similar in GitLab.
I believe our model for this will be started with gitlab-ce#47063
I think that's something different. It looks like gitlab-ce#47063 is about implementing something which is similar to job dependencies in Azure Pipelines. Given a job, you can specify its dependencies and the conditions for those dependencies (e.g. 'run when job A fails' or 'run when job A succeeds and job B succeeds').
Steps and conditions help you split a single job in multiple steps, and specify conditions.
E.g., a job starts with creating a VM, a container, k8s objects, runs tests (and passes/fails based on the results of those tests), but always removes the VM/container/k8s objects once the job has completed.
We moved from Azure DevOps (then VSTS) to GitLab about 1,5 year ago because the CI/CD infrastructure in GitLab was far ahead at the time. However, Azure DevOps now provides CI/CD features which are missing in GitLab.
I think that after_scriptfixing and final_scriptadding you are talking about - are very different things, which have no common withcancel the job event. You are thinking only about your restricted cases.
But i want to share with you another one, the case:
When somebody cancel a test execution job, team needs a notification, that job was stopped.
Because team have two notifications - start and stop, described in job. So, here are two cases:
Let success start-finish the job:
before_script: notify about job started
script: job with test execution
after_script: notify about job finished (successful or not, does not matter)
Canceled job which has been started:
before_script: notify about job started
! CANCEL !
no notification, that job was stopped by finishing or interrupting
So, i want to see on_cancel_script in my jobs. Thank you. Hope you will support this idea.
@jlenny @markpundsack@ayufan Can you guys think about scheduling this one? It feels like this is currently the biggest flaw of GitLab CI, in my opinion, and needs fixing this year, rather than in 2020 or 2021, if even that.
It seems like change build behavior to always run after_script is a popular solution. This change is relatively simpler to implement at the job level, but I think it raises a few questions:
If a Pipeline with 100 jobs is started and immediately cancelled, we potentially have 100 after_scripts that will still be queued to run, so is that Pipeline really cancelled? If that behavior is okay, we need to come up with a more descriptive command than "Cancel Pipeline", perhaps "Skip all Build scripts"? "Cancel Pipeline" might remain a true stop-the-presses command, but become a much more drastic action to take, i.e. "Crash Pipeline".
If we "Skip all Build scripts" (or whatever this cancel-but-run-after-script command is called), does that also continue to execute before_script for each Build? If we're talking about moving after_script into it's own special dimension outside of a Build script, it might be counter-intuitive to have it's pre- counterpart remain part of the Build.
Alternatively, we could leave before_script and after_script as part of the Build, to be abruptly cancelled at will, but provide a different method for setting up and tearing down resources at the beginning and end of execution of an entire Pipeline. This would look more like a distinct build or stage, before_pipeline / after_pipeline. While these section would have to do much more dynamic checking about resources that exist and figuring out how to ensure everything's been taken down, it might offer a much cleaner execution structure for the Pipeline as a whole.
Personally, the way I see the flow working is that for each job that is actually run you have a series of the steps, like:
global before_script
job before_script
job dependencies
job script
job after_script
global after_script
... now the problem ATM is that in case the job is cancelled during script, none of the others are being called, as in everything is being shut down at that particular point in time. The problem is that when you have a cleanup that you really have to do and included that in either the job's or global after_script it will not be called.
... therefore I don't see you're particular problem of creating xxx amount of after_scripts, but rather in case the cancellation is being identified, break the script part, but do continue with the afters until the CI's job is actually being considered finalized.
So the execution of after_script for a job would depend on whether or not before_script or script has started? I think we would have to introduce intra-job states to be able to confidently know whether a job has started, and from that whether or not to run the after_script of that job.
I was under impression that a single job is like a thread. It has a prerequisites, actual task and winding/cleanup. If it's not the case then yeah, having at least 3 states would be beneficial. Startup for something that can't be interrupted, actual task (defined by script, something that can be interrupted) and then the post-process (whatever needs to happen afterwards - artifacts, cleanup, both?).
As in the yaml is there to define the structure and some relation/dependencies. However is being built upon in the final form is irrelevant as long as it follows some understandable flow that will ensure that the most important tasks are performed for each of the jobs (before-task-after). At least that's what matters for me.
Also thats what I've experienced working with say Busted for Lua testing. It allows for defining a describe which explains a group of tests, then before_each for something that is being runned before each test, after_each as a way of cleanup and finally it for the actual test. You might think I'm asking for a lot, but to me it's a simple trio of tasks that have to happen, and only one of them is interruptable (or something that could fail).
A lot of people at my company click "cancel" to cancel a deployment, but clicking cancel doesn't actually stop the deployment, since gitlab is just calling an api to initiate it. There is a way to cancel a deployment, but a script needs to be called.
In other words, I would expect, and want the "cancel" button, to still run the after scripts, or some teardown of some sort. I do not think anyone expects a cancel to be a "stop the world, kill the container". If a job went into a "canceling" mode, that would be very reasonable, I feel.
before/after pipeline steps are not nearly as useful, as the teardown is usually very step specific
Just adding another vote for this issue. We are using Terraform to create VMs on OpenStack, so if someone hits cancel or the job times-out (my case), those VMs have to be manually cleaned-up which is a huge pain.
I have an shell-executor with before_script, script and after_script. If i cancel the pipeline, the job still continues before_script and script, but not after_script. It seem's a reliable (instant) termination of the job is required instead of going for a solution with after_script only.
Adding another vote for this issue. When we cancel a build and the cleanup script doesn't run, it 'poisons' the runner because the build leaves behind files owned by another user. When another build tries to start, it doesn't have permissions to remove those files, and so fails.
This is really needed. Our pipeline generates state in the deploy stage of which provision and test jobs depend.
If pipelines are cancelled in deploy,provision or test stage some or all of the state remains, including resource reservations which at some point are exhausted.
Thanks for keeping us updated @jlenny. Can't wait for this one!
When it's implemented, the newly introduced feature interruptible could be extended to also cancel pipelines that are still running but became irrelevant (e.g. force-push happened on any branch, or any push happened on an MR branch). Or alternatively, a new one introduced called cancelable to achieve exactly that.
Sorry for the delay everyone. There's some active changes going into Pipeline processing, I talked with @steveazz about a couple ways to handle this.
The first is to make a change in the runner where cancellation of a build would mean canceling the script portion of the job and moving on to after_script. From the perspective of the Rails app, we could send the job directly to the cancelled state but allow for the runner to continue posting trace information to the job for some period of time after the job has been cancelled, for confirmation of the after_script work. The most obvious, if not best, idea for this window would be to use the length of the runner's hard timeout.
A disadvantage to this is that after_script work becomes effectively uncancellable. To address this, we could introduce a new pre-cancellation state for jobs to indicate that the build has been sent the instruction to cancel, but is still running its after_script portion. If we introduce this extra pre-cancellation state, there could be a two-step "cancel script, cancel after-script" process.
Another second is to use the recently-introduced .post predefined stage to move post-job cleanup work into a separate, final job. An advantage of this is that moving the work into it's own job gives a bit more control of configuring your cleanup job to run. The pipeline .post stage could have a single cleanup-ec2 that runs with { needs: [ec2-deployment], when: on_failure }, etc. This would likely need some UI changes from the Rails side; cancelling a Pipeline would likely need to not cancel the specified cleanup jobs, and cancelling each job in a Pipeline one-by-one would be fairly tedious. I think this solution would be more effective and reliable in the case of a job timeout. A disadvantage is that in being decoupled from any particular job, we wouldn't have some specific information like $CI_JOB_NAME when we do cleanup
Both approaches have pros and cons. I think the first is most like what everyone had in mind so far, but I think the second has an advantage in separation of work, and in turn reliability. It's also possible that we should implement both, as they're both viable and could be useful for particular workflows. I'm putting them both to the crowd to see if anyone has strong opinions on them, and see if it's obvious that one should be implemented first.
A separate stage and job is not a good option. There's certain data like Docker container ID that would be lost across stages. Passing things like that via artifacts is extremely ugly. Cleanup script part must run in the context of an original job to be effective, flexible and powerful, just like try-finally from programming languages.
I agree. The original request is the way to go. I don't know about rails, but in python it would just be a matter of having the before script and main selection run as one sequence of steps in a subprocess, and having a finally clause to run the latter steps no matter why you were finishing up. There should never be a need to cancel a cleanup step; always let it run.
This feels like such a simple yet high value feature. We have been waiting well over a year, and this keeps getting bumped over many more questionable CI features.
@jlenny @erushton given that having the after_script run even after the pipeline is canceled/timeout seems to give us the most benefit from what Drew suggested, and from how people expect it to work. This would require some work on the Runner, as you both know 12.5 is jam-packed for the Runner team it will be hard to deliver anything from the Runner side.
If we decide that we want to have after_script being able to update the job trace (like it does now) we would need to make it so that the Runner can update the trace after the job was canceled/timed out so we can still have some background work being done in 12.5 and have the Runner change (running after_script no matter what) in a later release.
Also if we decide that the window of when the Runner can update the trace will be the same as the timeout of after_script (which is hardcoded to 5 minutes) we have to careful of issues like gitlab-runner#2716 (closed)
There is also a proposal from Kamil in #15603 (comment 214835671) that can help us achive somethig similar but not with after_script
How about having a cancel option and an abort option? Cancel moves to after_script, in which case the abort option is still active to cancel the after_script, and Abort simply aborts without calling after_script? Most of the time people should prefer cancel, but if after_script is stuck or if a job really needs to be cancelled, there's the abort option. I know that would work for me, not sure how it would work for others, but I agree with @nicholas.farrell that this is a pretty important feature, since there's no cleanup option this leaves us with poisoned runners if we weren't able to modify permissions and clean up files appropriately.
I agree with the above comments regarding the two proposals, we would really only benefit from after_script on cancelled jobs if it was ran on the same instance that ran the job itself. This allows us to cleanup resources local to the instance like running docker containers.
Would the runner's post_build_script also be ran in this case? It allows the runner administrator to define what needs to be cleaned up in addition to the project owners which would make it easier for everyone (define a runner-wide cleanup of containers rather than having every project implement that themselves). Also, given that after_script typically runs after post_build_script I think it would make sense to also include post_build_script in this new scenario.
Thanks for the responses everyone. I've updated the issue description with a proposal for implementing changes in Gitlab and the runner. We're always open to more feedback on current plans, and I'll also make updates here as the changes get made.
I disagree about the benefit from after_script on cancelled jobs if the same instance is not available. While that is needed for things like driving docker directly, I always drive things through Kubernetes, and need to clean up after jobs. The instance it is running on is not important to clean up job resources launched in Kubernetes.
So, it sounds like it may need to be configurable?
I've marked this issue as blocked because of the unavoidable runner changes involved in the current proposal. Backend, Part 1 can be worked on, but we'll need to get the runner issue scheduled before we can properly schedule this to be resolved.
In the mean time we should get as much feedback on the proposal as possible.
@jlenny when you have reviewed the proposal I wonder if we should have some kind of kick-off meeting so we make sure everyone is on the same page since this involves backend, runner, frontend, UX
This has missed so many release targets. Can we please get this in? We have resources that are tied to physical hardware that get borrowed during CI and if someone hits the cancel button we have no way to "release" it back to be used again.
@andy Here's a workaround I currently use to make Cancel not as disastrous in consequences...
-docker run -dv /var/run/docker.sock:/var/run/docker.sock docker sh -c "sleep 7200; docker stop \$(docker ps -aqf label=job-ndn-$CI_JOB_ID)"-docker run --rm --init --cidfile /tmp/docker-cid -l job-ndn-$CI_JOB_ID (run whatever)"
What the first line does is runs a container that waits 7200 seconds before running docker stop on a container that will get created in the second line. If pipeline gets aborted while the second line is running, first line will ensure it gets stopped - at some point. Note: all jobs use the Docker socket from the GitLab runner host system.
In your case, you want to find a way to schedule the release of resources that lives outside the job. You could expose /tmp/blah.txt from host to all jobs. Each job would write some info to this file (echo "timestamp some-id" >> /tmp/blah.txt before reserving the hardware. And then have a cronjob running on the host that uses /tmp/blah.txt to release hardware by their IDs and timestamps, as a safety belt for aborted pipeline. Something like that. Very clunky and over-engineered - but what else can we do without finally_script. :(
@crystalpoole & @thaoyeager I'm unassigning @drewcimino from this as he's no longer on the ~"group::continuous integration" and this is blocked so he can't work on it. Also the issue that this is blocked by is currently scheduled for %12.8 so I'd suggest moving this to %12.8 at the earliest.
Given the predecessor work has changed to %12.10, moving this to %13.0. @DarrenEastman whatever you can do to help prevent that from slipping again would be a huge help to delivering this super popular item.
I wonder if in order to simplify backend implementation we could reuse interruptible keyword. This has been proposed by @Nowaker in #15603 (comment 225836144), and I think this could be an interesting solution.
The problem is that currently interruptible defaults to false. I wonder if we should extend the keyword to support config beyond boolean values. Perhaps something like interruptible: graceful
As someone who has now been waiting for years for what I essentially consider to be a bugfix I am very confused by all the complicated alternative implementations proposed in this issue.
To me the issue is that after_script always runs on job success or failure, except when the failure is caused by a timeout. Timeouts are a totally normal failure condition and having them as an exception to after_script always being run when a job ends is what makes this complicated and unintuitive.
Running after_script on timeouts would be the simplest solution from a user perspective for the timeouts issue.
Having some way to also run it on cancelled jobs would be a nice bonus, but I think the timeout case is way more important.
All the other solutions proposed in this issue seems like over-engineered workarounds to me.
I'm picking this up on the runner side.
A couple of things seem unclear from the description:
what would the exact value of the new state be? So far I've used gracefully-canceled
should this issue include any work on supporting the trace having updates from after_script even on a canceled job? The runner currently stops sending logs, but seems like Gitlab stops handling them too.
Proper implementation could presumably look like something I described below:
Introduce teardown build status
Make it possible to configure interuptuble: graceful that will tell GitLab Runner that this job should be gracefully canceled
The behavior of how we process after_script should only change when someone explicitly adds this configuration entry to their .gitlab-ci.yaml
After cancelling a interuptible: graceful build transitions to teardown status and GitLab Runner stops executing script: and starts executing after_script with a proper annotation in traces
After the teardown is complete, GitLab Runner notifies GitLab to change the build status from teardown to canceled
Introducing teardown status might be quite complex task and I wonder if there is a more simple MVC here. Perhaps running after_script on the runner side, without actually sending build traces and letting GitLab not about what has happened during the after_script execution is an option, but this might be suboptimal.
@grzesiek I see, then does it make sense to reorder a little the deliverables you proposed?
Make it possible to configure interuptuble: graceful that will tell GitLab Runner that this job should be gracefully canceled
Expose interruptible to the runner through the requestJob endpoint
When receiving a cancel the runner would run after_script without the trace reflecting it if it sees interuptuble: graceful
Add support for the new teardown state. Add support in the runner for sending traces in this state.
I think it makes sense to start with first running the step even without trace as it will add value to users and then add the trace in a second iteration. Should we get input from product on that?
I can reflect it in the issues if you think this makes sense.
@jyavorska do you think this is a good solution (and better than the one in the description)?
For context: the idea is that users would set interruptible: graceful on their jobs whenever they want after_script to run upon cancellation.
Then there are the steps towards achieving it outlined above.
I edited the issues to the best of my understanding, @grzesiek you might want to take a look in this one if all the details make sense for the rails part of the solution.
As this issue is blocked by efforts in gitlab-runner#4843 (closed) which is currently scheduled for %13.1 - we will not carryover this issue from our %13.0 until the Runner issue this depends on is completed.
There is a discussion in the runner issue about this too gitlab-runner#4843 (comment 341790237). Posting it here for visibility and completeness, we are still taking about technical details there and making some progress.
@thaoyeager We'll need to ensure this issue gets weighted accordingly, which we can do in %13.1 to prepare for a future milestone. I'll throw the needs weight and put it into %13.1 for weighting purposes.
We in Oracle would love to have this feature as well. There are many time folks kill pipelines and other rare crashes, Would be nice to have this feature which will run always and can ensure that the cleanup tasks are always executed. Currently we end up doing the resource cleanup manually once things pile up enough to cause pipeline instability.
Does not paint a good picture that customers have been asking for this feature for 3 years and nothing has been done :(
I will add my own use case to this. Our pipelines are auto-cancelled when a new commit comes in on a branch. Our CI jobs make use of external resources that they have to reserve. We need to release those resources on cancel, but cannot currently because after_script is not being run.
Reading back through the comments on this issue make me concerned that fixing this is not a priority.
@thaoyeager as the pre-requisite runner change won't ship in Runner 13.2, gitlab-runner!2044 (closed) and is now slated for 13.3, I have updated the target milestone for this issue to 13.3.
@jyavorska For label hygiene reasons, I' m going to remove the ~"missed-SLO" since this is now a directionfeature, as that label typically pertains to missed bugs
As a user, it 100% feels like a bug when the after_script doesn't run after you cancel the job.
(edit): I say this because looking at the number of times a fix has slipped, any suggestion that it's a feature and not a bug implies it will not be fixed.
@dmlary I was just mentioning so it ends up mentioned in our release post since lots of people are interested, it doesn't change on its own whether this is scheduled or not (and this one is already scheduled).
@thaoyeager the Runner team is getting close to merging the work needed on the Runner in gitlab-runner!2044 (closed), before we go ahead with this I wanted to check with you folks if the proposal still makes sense from a technical and product perspective since it has changed quite a bit and out last discussions has been a while.
Product:
We/Users are ok that they have to enable this feature with interruptible: graceful.
We will still upload artifacts on failure
Traces are still going to be updated
Technical:
Rails as to send when: graceful to the after_script step.
When the job is canceled and interruptible: graceful it is going to enter a new state, which is called teardown
teardown state will still accept traces/artifacts and leaving the CI_JOB_TOKEN valid.
Outstanding questions:
When job is in the tearndown state will the minutes still count?
When the job is inteardown` state will the job timeout still be in effect? For example, if the job timeout is 1 hour, and the user canceled after 40 minutes does that mean the Runner still has 20miniutes to upload the artifacts, because sometimes artifact upload can take a really long time if it's large files. The reason I'm asking this is that we are discussing the timeout in gitlab-runner!2044 (comment 374833936)
I wonder why change the existing contract of after_script, and make its behavior dynamic based on other properties, when we could introduce finally_script (like try-finally in programming languages) like @ayufan once suggested. It would sure make things way simpler. And is very expressive about what it is.
I wonder, can we move away from depending on the name of a step, and rather have a simple when based workflow to define when to run it?
The suggestion here is that we are tying a special behaviour on after_script, where it would be simply better to not care about after_script (the name), rather have this control logic generic, and execute all steps sequentially based on the when.
We could also describe each step to have image:, timeout: (it already has), etc. Then the Rails could be build in an any way we want to even add in the future on_cancel_script/finally_script: if needed without having to change Runner.
The suggestion here is that we are tying a special behaviour on after_script, where it would be simply better to not care about after_script (the name), rather have this control logic generic, and execute all steps sequentially based on the when.
Yes, and no. For the after_script we will be looking at when directly and make the decision on it, having the Runner respect it for every single step seems like a more complex change than a small iteration where we have respect when for the after_scrpt and then move onto having each step respect it, does that make sense?
We could also describe each step to have image:, timeout: (it already has), etc. Then the Rails could be build in an any way we want to even add in the future on_cancel_script/finally_script: if needed without having to change Runner.
I wonder why change the existing contract of after_script, and make its behavior dynamic based on other properties, when we could introduce finally_script (like try-finally in programming languages
We need to have a forward-compatible way where a newer Runner version can be used with an old GitLab version because in GitLab when a job is canceled now it will still accept artifacts, and trace uploads.
Thank you @Nowaker@ayufan for the feedback you gave me a lot to think about, I know we have discussed when: always for the after_script and it always respects that in gitlab-runner#4843 (comment 330853303) but felt like using when: always seems like a behavior change to run a step even if the job was canceled? That is why we have when: graceful.
having the Runner respect it for every single step seems like a more complex change than a small iteration where we have respect when for the after_scrpt and then move onto having each step respect it, does that make sense?
Yes, it will be, but also we do it once, and this logic can be fully controlled with Rails later. We today only have script and after_script.
@thaoyeager@grzesiek I've scheduled a meeting to go over this 1 more time to make sure that we are on the same page and cover any outstanding question. If someone else from the CI needs to be involved please tell because I'm not too sure who the DRI of this issue is going to be
The agenda is as follows:
When will rails send when: graceful to the Runner? If we look at #15603 (comment 379890531) at the moment the user has to set interruptible: graceful but users want this to be the default behavior
Will rails introduce a new job state called teardown, does the Runner have to do anything about this?
When the job is in teardown state, how will timeout work?
Do we still need to enforce a timeout for artifact uploading?
When the job is in the teardown state, will the Runner still be able to upload traces and artifacts, so the CI_JOB_TOKEN is still valid?
As a user I would like interruptible: graceful to be default since I want this behaviour always. But I can leave with the proposed solution which can be done in later releases if you want. The reason agains such flags is that it tends to make the CI files larger with every added flag.
Regarding the timeout it feels like artifacts are not so important like the afterscript. It is important that afterscript completes to be sure that the cleanup is completed. So for me the proposed timeout for artifacts is ok. Especially if today the upload of artifacts is included in timout time.
@cheryl.li@thaoyeager@DarrenEastman@erushton this has been slipping month to month for over a year now - do you feel now there is a very clear plan to wrap it up this month? Or is there something more we can do to understand and mitigate risks?
It seems like there's still some disconnect on the plan - nothing huge but certainly it's not clear. @steveazz is going to set up a call for early next week to wrap-up any lingering ambiguities.
@jyavorska@thaoyeager Perhaps we need a DRI on ~"group::continuous integration" and grouprunner each since it requires effort between both groups. We should also clarify whether this is still workflowblocked as this is not clear. (I also wonder if we should split this into separate issues so each group can tackle the respective parts)
Sure, all those things are possible @cheryl.li. Looks like @erushton and @steveazz are setting up a call which is a great first step. @thaoyeager would be the PM DRI already for the CI side. @DarrenEastman is the Runner PM DRI.
@steveazz Thanks for setting up that call for tomorrow but unfortunately I cannot make it. Maybe we can have the 2 PM DRIs get aligned with the engineers first?
We need to have interruptible: graceful because it can be considered a breaking change if we suddenly change when after_script gets executed.
GitLab Runner
When the job is canceled, it will stop what it's doing and move onto to the after_script to check if when: graceful is set. If when is set to graceful execute it.
No artifacts will be uploaded, since when for artifacts is on_success and on_failure are not graceful, in the future we will introduce artifact:when:graceful so folks to decide what artifacts need to upload when the job is canceled/timeout
The Runner will assume that if the job is canceled it can still send job traces. Rails will have it's own internal state to differentiate between a graceful cancellation and hard cancellation (what we have now).
The Runner will return a feature called graceful as part of features.
Rails
Will introduce a new keyword interruptible: graceful where the job is canceled will report to the Runner the jobs is canceled and the when of after_script it set when: graceful.
When the job is canceled rails will handle the internal state to teardown but still report to the Runner that it's canceled.
When the job is in teardown the CI_JOB_TOKEN is still valid so it can still accept trace uploads.
As future iteration rails will introduce artifacts:when:graceful so the runner can upload artifacts for gracefully canceled jobs.
We need to have interruptible: graceful because it can be considered a breaking change if we suddenly change when after_script gets executed.
Can we hide that behind Runner Feature Flag and switch over to this behaviour by default in the future after deprecation period of old one? We already run after_script on failure, and I don't really see a reason why we should not run it on cancel as well. I'm not really sure if this needs to be a special behaviour that we support.
Or, can we just introduce finally_script: to be executed always?
When the job is canceled rails will handle the internal state to teardown but still report to the Runner that it's canceled.
The teardown seems like a strange state name. teardown can happen for every state, and not particularly canceled. It does not represent that this needs to be canceled in the end. It would be better to be explicit and have canceling.
When the job is canceled rails will handle the internal state to teardown but still report to the Runner that it's canceled.
OK. It seems that we can continue sending 202 and 200 to client on PATCH /trace and PUT with the canceled state.
When the job is in teardown the CI_JOB_TOKEN is still valid so it can still accept trace uploads.
I guess canceling. And yes.
As future iteration rails will introduce artifacts:when:graceful so the runner can upload artifacts for gracefully canceled jobs.
Should it be able to upload artifacts? What would be the use case for this?
artifacts:when:graceful
This feels a strange name to be, and a name clash to current usage where we have always/on_success/on_failure. How the graceful fits into this? Is a graceful considered a failure?
Summary
I believe we have two work streams here:
Make Runner to run after_script on cancel
Introduce in Rails a temporary state of canceling, that would return as X-Job-Status: canceled, and would continue accepting trace and updates during canceling. canceling on final UpdateJob would always transition into canceled. canceling would be considered a running? type of state, but explicitly checked to be running?. canceling in a composite status would behave exactly as running.
Can we hide that behind Runner Feature Flag and switch over to this behaviour by default in the future after deprecation period of old one? We already run after_script on failure, and I don't really see a reason why we should not run it on cancel as well. I'm not really sure if this needs to be a special behaviour that we support.
@ayufan, I did discuss this with @grzesiek. However, I think it's too much of an assumption for us to just change the behavior of the after_script even though it's something that is "expected".
Or, can we just introduce finally_script: to be executed always?
I'm fine with those, it's a bit more expressive for sure and doesn't collide with interruptible: true
The teardown seems like a strange state name. teardown can happen for every state, and not particularly canceled. It does not represent that this needs to be canceled in the end. It would be better to be explicit and have canceling.
The API between Runner and Rails will still return canceled it's just an internal state inside of Rails. The name can be anything really we didn't decide on the name because the person implementing it would know what best name fits it.
Should it be able to upload artifacts? What would be the use case for this?
This feels a strange name to be, and a name clash to current usage where we have always/on_success/on_failure. How the graceful fits into this? Is a graceful considered a failure?
That the point graceful doesn't seem to fit anywhere right? So that we suggested to have it's own artifacts if it needs in the future. At the moment always is only triggered if there is a job failure of successful but not but doesn't upload for cancelled jobs, if we start trying to upload these artifacts is it going to cause problems/unexpected behavior to the user? What about timeouts?
The name can be anything really we didn't decide on the name because the person implementing it would know what best name fits it.
The teardown does not represent that something is in canceling state. teardown is a general term that can be executed after any state: teardown after success/failure/cancel.
However, I think it's too much of an assumption for us to just change the behavior of the after_script even though it's something that is "expected".
This falls between convention over configuration. Do we really, really, really need this to be configurable? Do we really want to add another option to configure that aspect?
Maybe, instead we should run after_script always, but provide something that is missing today I think: an indication what was the prior state of image/services/script on which we finished execution and make it possible for after_script to be adopted based on user desire?
We run after_script always. We expose in after_script indication on what execution stage the execution stopped, we give a status of that execution. Someone could now use after_script to send notification and provide more details.
Then, we could just have after_script and not introduce finally_script, but:
Make after_script be simple to understood: it is executed always regardless what you do
Expose details on what conditions the after_script is executed on
Let user decide what he wants to do with after_script
We run after_script always. We expose in after_script indication on what execution stage the execution stopped, we give a status of that execution. Someone could now use after_script to send notification and provide more details.
I like this idea. Knowing the state of the job during any 'final' script (no matter how it's done) sounds like a must have.
However, I think it's too much of an assumption for us to just change the behavior of the after_script even though it's something that is "expected".
I've found myself using after_script on single jobs just to get better timing information / grouping something logically:
before_script: - install dependenciesscript: - build a thingafter_script: - publish a thing where I want to also know how long it took, not part of 'script'
I might be using it wrong, but the behaviour is re-enforced by how it currently works. I agree that breaking this might be a problem.
Or, can we just introduce finally_script: to be executed always?
I think something like this approach is clearer than interruptible: graceful.
I'm still catching up with this issue, so ignore if already discussed, but have we also considered:
after_script: when: always script: - echo "execution status was $CI_EXECUTION_STATUS"
I agree as well. This is why we can use feature flags.
We could enable feature flag by default for everyone with Runner %14.0. During the %13.x, we would print a warning message that after_script: behaviour will change and advise to enable feature flag before. Enabling feature flag would make warning to go away.
We don't need when: graceful for above proposal of mine. after_script is executed on failure, on cancel or on success and it is not configurable. You can decide if you want to do anything with the env variable.
@ajwalker if we end up going to feature flag route we wouldn't care about the when in the after_script scenario, since we will always run it even if the job is canceled. This might end up being confusing when we implement gitlab-runner#25439 since after_script will have always and at the moment always doesn't mean when the job is canceled as well
I'm all OK with having this behind a feature flag, @ayufan I'm guessing we are going to have 2 feature falgs 1. Inside of rails 2. Inside of Runner. When the user turns on the feature flag on GitLab it will set the environment variable (to turn on the Runner feature flags) automatically.
@grzesiek Given that @ayufan is on PTO until August 14th, could you review his proposal and discussion in this thread and let the Runner folks if anything is to change from your original sync call on July 24th? This would affect scope and their ability to deliver this issue in %13.3. Is there something raised that can be addressed in follow-up issues, for example?
I like the idea of having a simple after_script behavior. It is simple and elegant, but might not be backwards compatible. Using feature flags can help with the rollout, but because we never know how users are updating their GitLab installations in a cross-version manner, we might break compatibility this way. Perhaps it is okay.
One thing that is not clear to me is how do we surface the status of after_script execution? It is easy with before_script + script - we combine them, and the result equals build status. But how do we surface the status of after_script? I really like the idea of CI_EXECUTION_STATUS - this way we can make use of a build status within the after_script, but how do we surface errors in after_script to users?
Perhaps the solution here is to include after_script status into build status, but what happens if before_script + script is "failed" and after_script is "success"? How to differentiate this case from before_script + script = "success", but after_script = "failure"? Is that important?
I really like the idea of simplifying this, I'm not entirely sure if the new proposal is completely clear to everyone, as it is not entirely clear to me, but I might be missing something here, and I'm sorry in advance if I do! @ajwalker@ayufan @steveazz
I think that we might need to prepare a table of statuses transitions + interactions (like canceling a build) for new and old proposal to untangle this behavior a bit.
I really like the idea of CI_EXECUTION_STATUS - this way we can make use of a build status within the after_script, but how do we surface errors in after_script to users?
The existing behaviour of after_script is to ignore the returned error. I'm not sure whether that's the correct thing to do or not, but does anything we add here justify changing that?
I found an issue (#16579 (closed)) wanting to add CI_BUILD_STATUS for a similar reason to what we have here. I'm adding it here so I can keep track of it, as we may well solve it in the same MR.
One concern that I have is that the build timeout is currently for the whole execution (prepare, user script run, after_script, upload).
For this change, the current proposal is that if the build timeout is reached, we still execute after_script, and provide it a hard-coded 5-minute timeout.
In this scenario, with the build timeout being exceeded, after_script will work, but no operation after will, including artifact and metric uploads. A cancelled script from the UI would allow these operations to still function, but a build timeout would not (there's no time left for them to be executed). Is that okay?
It's a larger change, but it seems like it would be easier to reason about if we split the build timeout value into multiple values, some user configurable and some hard-coded.
I guess that what we are still missing here, might be:
How to ensure that we are not breaking compatibility in a way that is disruptive for users
How to ensure that after_script execution can still be surfaced through a build trace if we decide not use interuptible: graceful and running - teardown intermediate state before canceled.
How to ensure that we are not breaking compatibility in a way that is disruptive for users
@grzesiek Would a feature flag that we turn on rails be too strict? If we turn it on GitLab.com are we going to end up breaking a lot of users? What if we do it as a feature flag on the Runner? The runners feature flags are a matter setting a variable either through .gitlab-ci.yml/CI Varibles or through runner configuration. Like this users can opt in on a job level if they want. Then in a major change (14.0?) we have this the default behavior.
How to ensure that after_script execution can still be surfaced through a build trace if we decide not use interuptible: graceful and running - teardown intermediate state before canceled.
Is it possible to have rails define all the time that if a job is canceled AND the timeout hasn't been reached yet that it will still accept trace updates? So for a canceled job to be "not running" it also has to reach the specified timeout
My only question regarding this would be, for future use. In when the runner start respecting the when would it mean the always is also going to run for canceled jobs? It's not something we need to decide now but it might be something that we need to keep in mind to take a decision on.
Is it possible to have rails define all the time that if a job is canceled AND the timeout hasn't been reached yet that it will still accept trace updates? So for a canceled job to be "not running" it also has to reach the specified timeout
How do we know that we can archive a build trace this way? I think that we either would need to refactor how we cancel builds or use interrupible: teardown (I somehow like teardown better than graceful to be honest ).
My gut feeling that using interruptible: might actually be easier, bo this will also require a significant changes on the rails side, but I might be wrong here. I would be great to hear @ayufan's opinion here.
How do we know that we can archive a build trace this way?
Runner posts success or failure to indicate that a job is finished. We could post failure but the failure reason could be set to cancelled. But as you say, it looks like this would require a refactor on the GitLab side for how builds are cancelled.
Even for Runner I think this change would require quite a refactor as we also have logic around what statuses we send trace updates on and the behaviour doesn't seem very straightforward.
In this scenario, with the build timeout being exceeded, after_script will work, but no operation after will, including artifact and metric uploads. A cancelled script from the UI would allow these operations to still function, but a build timeout would not (there's no time left for them to be executed). Is that okay?
But as you say, it looks like this would require a refactor on the GitLab side for how builds are cancelled.
I think to retain traces/artifacts we still need to add canceling state to indicate the intent, and allow Runner to gracefully cancel.
The after_script does not contribute to status, so whether it is success or failure it does not make difference.
I think looking at this issue and number of I think people do expect that after_script is also executed on cancel. If so, do we need a separate way to configure it?
Even for Runner I think this change would require quite a refactor as we also have logic around what statuses we send trace updates on and the behaviour doesn't seem very straightforward.
I think that canceling for running is ambivalent. Likely we would need to improve cancel detection to have a different code for it?
I think that the main concern of mine here is that Rails side needs to know when a build is concluded. It means that it needs to know when it is entirely canceled and this becomes a final state, and when it is being canceled, namely - it is in a teardown state. I'm fine with running after_script always but we need to have a way to model this on a Rails side, because such behavior currently does not exist - a build is either running or canceled - canceled is a final state, no further trace appends or uploads are possible.
In order to make after_script useful we need to have a way to surface after_script output.
I think that how we implement this on the runner side will be a consequence of how we model this on the rails side, and this is where we should focus right now.
@DarrenEastman it looks like you applied the milestone here but the group is @thaoyeager's group. Did you intend to change the group to grouprunner? Or is Thao the one building this one and it's something you'd arranged?
Once we determine who is building this, my second question is if this is really doable in 13.4 if it is workflowblocked?
@jyavorska Oh - I just realized that I changed the milestone on the wrong issue. I meant to change the milestone on the Runner pre-req, gitlab-runner#4843 (closed).
On the Runner side, per the last synch meeting it seemed that we had a plan and an approach that could be delivered in 13.3, however there were subsequent discussions and so we did not complete the work in Runner 13.3. Right now we are targeting 13.4, but this assumes that both teams are in agreement with the implementation in the Runner code base.
If this issue is blocked by gitlab-runner#4843 (closed) which is now in %13.4 then for now I'll move this issue to the backlog with ~"candidate::13.5" label.
I'm still unclear on the order of implementation so correct me if the sequence I'm assuming has changed.
graph TDstart[User presses cancel button] --> FFFF{Feature Flag turned on?}FF -->|yes| canceling[Send job status `canceling` to runner]canceling --> ui[Update UI]canceling --> runner[Runner understands `canceling` and start executing `after_script`]runner --> wait[wait for runner to update job status to `cancelled`]ui --> waitFF -->|no| cancelled[Send job status `cancelled` to runner]cancelled --> stop[runner stops what it's doing]
@thaoyeager we probably need some UX person to look at the canceling state.
runner will let rails know if it supports the canceling state. We still need to decide on what will happen if a runner is running the job and doesn't support the canceling state.
It was discussed that rails should store runner feature information to decide on which job status to send.
Check how old runners behave when we send canceling status.
When job is in canceling state
The CI_JOB_TOKEN is still valid so the runner can upload traces.
rails will wait for runner to send canceled job status.
To inform runner with canceling send a 200 OK and Job-Status: canceling
Runner implementation
graph TDjob[running job] --> trace[update job trace]trace --> jobStatus{job status response from rails}jobStatus -->|cancelling| puase[stop user script and anything that comes before it]puase --> after_scriptafter_script --> status[send job status update to be `canceled`]status --> finishjobStatus -->|cancelled| finish[stop everything and run cleanup]
runner will send a new feature that supports the canceling job status.
runner will send the canceled job status when it's finished
When runners get the canceled job status it will behave as it is now, aborting the build and stop everything.
Sequencing
Ideally, all of these should be in separate merge requests.
Rails and runner work can be done concurrently, everything is going to behind a feature flag. For the feature to be functioning we need both rails and runner work to be finished, but it's not blocking in the sense that each team can't start working on their own tasks.
Rails
Implement support for the canceling state and send it when user clicks cancel button. Allow runner to append trace for caneling state.
Update UX/UI to show that the job is in the canceling state.
Runner
Add support for canceling state.
Run after_script when job status is in canceling state.
Introduce CI_JOB_STATUS to tell the user of after_script at which stage they are in. This will allow users to opt out from running after_script on canceling jobs in case the feature flag is turned on. This is done in gitlab-runner!2342 (merged)
Out of scope for this issue
@jyavorska@thaoyeager@DarrenEastman at this stage just supports the scenario of canceling the job/pipeline is a lot of work and a lot of complexity. I think we should remove the timedout scenario for now and maybe discuss this in a separate issue. The timeout is a different scenario becuase timeout can be considered as a security feature and it will has it's own edge cases.
Timeline
For the runner, I'm still confident that we can ship this in %13.4, I can't really speak for the rails team here.
@ajwalker@grzesiek@ayufan thank you so much for the meeting it was really helpful. Please let me know if I missed anything. I will update the issue description shortly if no one disagrees with this.
@thaoyeager@DarrenEastman it feels like there should be some MVC epic to tie all this together and ensure we're breaking it down appropriately into minimal iterations, each with their own issues. Maybe the two of you could sync up and make sure there's a clear plan.
@jyavorska if @thaoyeager thinks it will be helpful I am ok with joining a synch call. As it stands now, I am planning for the Runner team to deliver gitlab-runner#4843 (closed) in 13.4 as the updated proposal indicates that the Runner and Rails work can be performed concurrently.
I just stumbled upon this issue, since we migrated a functionality from the custom_executor to the docker_executor and noticed I have no way to trap these cancel/timeout cases for cleanup before the container gets removed
On a regular pipeline this happens:
Software license is applied to the container
The software builds some stuff
Software license is released from the container
When you cancel in the middle or a timeout event happens, the license gets stuck'd and thus we have to release it manually which is really annoying.
On the custom_executor I was able to modify the cleanup stage, so I could forcefully release the license on most cases before removing the VM. I'd like to avoid rebuilding docker functionalities on the custom exec just to get that feature.
Hi Team, please can we confirm if we are on track for upcoming 13.8 release or could there be further work required? I have a customer who is looking for this functionality and I'm trying to provide an update on when it might land. Thanks in advance
@ricardoamarilla It's largely due to the efforts of @ayufan, who has kindly offered to help us implement this for our group even though he isn't on CI. I do see this issue is blocked by gitlab-runner#4843 (closed). Kamil, do you think this will make %13.9? We can defer it if you don't have bandwidth to tackle this. Thanks!
use case: Rarely, jobs being executed are interrupted, for example due to network errors. This left a dangling running docker container that needed manual clean-up, due to the after-script not being called. This issue would allow for a more graceful way to handle these errors.
@thaoyeager@cheryl.li This issue was scheduled for a previous milestone but missed it, leading GitLab bot to push it from one missed milestone into the next. Can you take a look at this one and if needed, remove the milestone to properly schedule for a later release?
I think this issue is very important and should definitely be tackled in due time, but the missed milestone situation is misleading to customers at this point.
@ayufan do you still have bandwidth to work on this?
@mbruemmer Thanks for the ping; I've removed the milestone for now to avoid the Bot moving it along as you say. We will add a correct milestone based on @ayufan's reply.
@cheryl.li Since I haven't been tracking this issue since it's being handled outside of our CI team's capacity, I'm also going to remove the misleading missed:[milestone] labels for the ones where I can tell (from MR progress/comments) that no work was being done on this issue.
@cheryl.li@jreporter I see that the other issue that blocks this one is due for 14.1 now (was originally 14.0). Once 4823 is delivered, would we be able to prioritize this issue?
@joshikripa thanks for checking in, I believe that is the only blocking issue although will need @cheryl.li and @thaoyeager to confirm technical scope & priority
Link to request: https://gitlab.my.salesforce.com/0014M00001hHU6P
Why interested: To allow after_script to run on cancelled pipelines
Current solution for this problem: None
Impact to the customer of not having this: -
PM to mention: @thaoyeager
I think we need to talk about that. Now, I'm swamped with Sharding and maybe this is a final moment to say that I will not be able to finish this or give it a low confidence score if I would commit.
Why interested: Why is this important to you? Run after_script for cancelled jobs (!2443 (merged)) · has problems where half of the stuff gets done and users try to run it again and now it causes problems.
Current solution for this problem: None
Impact to customer: After job script fix is nice to have but not currently blocking them.
PM to mention:@jreporter I see the discussion above so this is more for awareness.
The after script cancellation issue came up #15603 again on our cadence call.
No cleanup is performed and they spend a bunch of time cleaning this up. They would like GitLab to change the way things are killed or add this FR. The after script already has a bunch of these things which would help them. It's Impacting all 250 users.
@steveazz @ayufan This issue depends on gitlab-runner#4843 (closed) which used to be worked on in gitlab-runner!2132 (closed) but there hasn't been any activity for over a year. What's the process in cases like this? Should a GitLab employee (eg. @ajwalker in this case) implement the changes they requested and push them to the branch so that the MR can be merged? Or should someone in the community (eg. me) pick up the commit and fix it up, and create a new MR to replace the dead one?
@steveazz You also had a "competing" solution at gitlab-runner!2013 (closed) which you closed in favour of the now-dead one; maybe you could re-open it?
I've just started hitting this issue myself as I have a job that essentially contains script: docker-compose up and after_script: docker-compose down -v and a simple cancel or timeout of this job makes the gitlab-runner server unable to ever run this job again until I log into the server and kill all the dead-in-the-water (ship joke :P) docker containers.
@1ace I'm not sure I understand why gitlab-runner!2132 (closed) would be needed for this feature, graceful termination on Docker container and running after_script on cancelled job are separated and not really related Am I missing something?
Why interested: Customer has a post-job trigger that does not work w/ after_script if job is cancelled or times out. They'd like this functionality to allow trigger from pipeline job in YAML for cancel / timed out pipelines
When using Unity, each Runner must be licensed. Currently, the customer does this using "pet" servers that are licensed on creation. However, as the customer moves to Kubernetes runners, they'll need the ability to license and revoke licenses for each Runner they use. Currently, Unity requires that the machine (docker container, VM, or bare metal) return the license itself. The customer is not able to return a license using another machine, so this feature would be useful in returning licenses on cancelled/failed builds. Otherwise, those licenses are "lost" and the customer has to open a Support ticket with Unity to get them back.
For Kubernetes Runners, the docker volume isn't easy to retain in the event of a failure, so there isn't a good workaround for cancelling builds with Kubernetes Runners. This feature would give the customer more control of their licenses until they can support Unity's Floating Licenses.
We faced the same problem, when running jobs that start a docker-compose.yml to run some integration tests, in case of canceling the job, the running services from the docker-compose.yml, remain running!
@arihantar While I do not work for GitLab, realistically speaking most workarounds involve additional periodic background tasks of some form (e.g. cron, systemd-timers, daemons of some kind) to clean up the mess one makes when a custom job with side effects (like starting docker containers that aren't started as image or services) times out.
For example we (@timetac) have a script that checks for the runtime of a container with our custom prefixes and then will forcefully stop it if it's over the default GitLab job timeout.
However, that kind of scenario is undesirable (because you need to maintain several components in several places) and typically only works if you fully control your own infrastructure (e.g. your own runners and also the bare metal under them).
To keep everything in the same place, my workaround has been to add a scheduled pipeline to run every hour that checks for anything that needs to be cleaned up.
E.g., the after script normally does something like
az group delete test-$CI_PIPELINE_ID
but the job on the scheduled pipeline does
for group in test-*:if$group older than timeout: az group delete $group
Do you mean we have to monitor k8s (or azure) to terminate the process externally without depending on auto-killing by timeout value? (For gitlab-system view, do I need to remove timeout value from ci.yml?)
It depends what clean up you were normally doing in the after_script. In my case I was deleting a temporary Azure group that had been deployed to run some tests. If your case is stopping some kubernetes deployment, then that is what you'll have to do in the scheduled job.
The key difference is the scheduled job has to search for any "things" that weren't cleaned up because it doesn't have the context of the job/pipeline that created them. It also has to make sure that they're not still being used, hence the check the check that they're older than the timeout of a job that might be still be using them.
Hi @mbobin - can you review the current proposal to see if it's still accurate or if anything is missing? Depending on your feedback here, we can then figure out how to break up this effort across a few milestones to ensure we have a way to easily manage this effort.
Hi @bprescott_ - thanks for the ping here. I've reached out on the blocking issue to grouprunner to determine the priority there to help unblock us here.
Another premium, self-managed customer here as well. Sometimes, at our company, we run migrations that we want to cancel on our non-production pipelines. It'd be nice if we could have an on_cancel pipeline to find the process IDs of the queries that are running to terminate them.
I work on the DevSecOps team for a simulation program. Anytime our simulation's source code and/or build parameters are modified, we execute a set of regression runs on our simulation as part of the merge request pipeline. We can have up to many hundreds/thousands of regression runs that need to be executed. To execute them, we have setup gitlab-runner on the head node of our cluster and associated a job with it. The DevSecOps team only has a small amount of the cluster processors reserved for performing these runs.
The issue we are running into is that whenever someone pushes up a change during a merge request, it kicks off the merge request pipeline again. However, the cluster job submits the regression runs using our cluster's workload management program, and as soon as the request is made, the command is finished. When the cluster job is cancelled due to a more recent merge request pipeline getting kicked off (due to a push), we are unable to tell the cluster's workload management program to cancel the regression runs associated with the cluster job. This could be resolved if the feature talked about in this thread was implemented.
We are devising a work-around at the moment, but ideally this feature gets implemented.
If money motivates you: We have an Ultimate subscription with 20,000 users.
We're a paying EE customer (180 users) and would like to see this added as well. It's pretty essential for any kind of situation where a GitLab CI job needs to interface with external systems or hardware. You need to be able to properly shut down outside jobs, release locks you claimed, etc.
Hello! 700 seat Premium customer requesting urgent attention to this fix (they have been vocal about it for over a year). They are currently doing resource cleanup manually in reaction to pipeline instability. This results in a poor developer experience (pipeline performance) in addition to the added manual work.
I see that it is slated for FY'23 but also still in backlog - is there any more detail wrt prioritization and timing?
Hi @jmiklos - this issue is currently blocked and we're engaging with the Runner team to figure out how to get it unblocked. At that point we will be able to schedule by applying a milestone.
Thanks @jheimbuck_gl , understood there is a blocker - are you tracking that on another issue? Please link, if so.
Communicating any/all updates as you receive them would be greatly appreciated. The customer is extremely concerned with this bug due to its impact on their business and the large number of missed release targets for this issue already. Thanks in advance.
FYI, lack of progress could impact the customers upcoming renewal.
Running into this when users cancel jobs that start a vagrant VM and then our VM is left in an unknown state. Users have to manually engage with the gitlab-runner in order to clean up state.
Also surprised this has been open for 5 years with no acceptable workaround offered either.
Cluster jobs can be named. As such, each batch job we send to the cluster through our pipeline job is named ${CI_PIPELINE_ID}-${CI_JOB_ID}-${UNIQUE_BATCH_JOB_ID}.
I created a personal project whose sole purpose is to run an infinitely running pipeline which:
Queries the cluster and asks it for the names of all cluster jobs currently active or still in queue.
Note: We also ask for the cluster job id associated with each cluster job.
Since each cluster job has its associated pipeline ID embedded into its name, we construct the list of all pipeline IDs that are associated with cluster jobs.
Note: We carved out a dedicated partition of the cluster to perform these jobs, so it's not possible for jobs to have names not in the aforementioned format.
Query the GitLab server, via the restful pipeline API, for the status of all pipelines associated with the pipeline IDs in our list.
Construct a list of all pipelines which are cancelled (or in a condition where we would want to stop the cluster jobs).
Construct a list of all cluster jobs which need to be cancelled. The cancel command takes cluster job ids, not cluster job names. Thus, using (4) and (1.1), we form the list of cluster job ids to be cancelled and then send the cancel command with this list to the cluster.
Sleep 30 seconds.
Why a personal project? Probably didn't need to be a personal project, but I don't remember the motivation for it.
@kevikev You should be able to query the cancelled job to figure out which runner it ran on. Funnily enough, there's another bug (#273378 (closed)) where we had to do a similar workaround. Our approach for that bug was to write another personal project with an infinitely running pipeline, whose job is to constantly query GitLab for the X most recent pipelines, check if they were cancelled, and if they were, forcibly cancel all their child pipelines. In your use case, you can query for X most recently cancelled jobs, determine if any of them are associated with starting a vagrant vm, and if so, going to the machines where those jobs ran and then kill the vms (or perform some other sort of action). This is assuming that which runner a job ran on is part of the data that can be returned from an API call.
Whatever you end up deciding on, note that no one is assigned to this backlog item, and given the history of this backlog item, it will probably stay that way forever.
After reading 5 years worth of comments, I think Gitlab should prioritize this issue a little more.
It's clearly a make or break situation for a lot of people
(But for real, this is just how these long-running asks seem to be for GitLab. The LFS/archive one in #15079 (closed) was the same way for a bit and also sat open for five years before it saw much action...)
A Large, Self-Managed Customer (SF Link - internal only) with ~2000 Ultimate seats has indicated their interest in this issue in this ZD ticket (internal only) to see a release in the current milestone of 15.6.
The real issue is that this basic feature being missing undermines GitLab's messaging that their CI platform is ready to compete directly with Jenkins, Circle CI, Github Actions, and so on.
There is a blocking dependency which needs to be completed before this feature can be implemented. I recommend advocating for the dependency in order to see this to fruition.
@jheimbuck_gl - One of my Ultimate customers is looking for a little clarity around this issue.
They are looking for a runner process which ends jobs gracefully (handling signals HUP, QUIT, KILL, etc.) rather than an abrupt "kill -9" which leaves debris, containers, and pods laying around. Would this issue address that need or should I open another issue for that?
@bmiller1 this issue could address that if they are using that after_script to do the cleanup but it'd have to be a bit of general purpose scripting I think. So while this could get to the outcome they want there may be a simpler way to get there so let's open an issue for their ask. cc @DarrenEastman since that would potentially be a grouprunner issue
@jheimbuck_gl, @DarrenEastman, we have a 985 seat Premium customer who is watching this issue, would like to know what the progress is for completing this issue. Not having this resolved is causing this customer, and I'm sure many others given the comments and upvotes, increased inefficiency, rework, cleanup, etc, that could all be eliminated with the completion of this issue. Can y'all provide an update as to where this issue stands, what work has been completed and whats left to be completed, plus some idea as to a target milestone for delivery?
@laurie_howitt thanks for the ping. the team is currently reviewing the initial POC and what is left to do in the 15.8 milestone after which we will create any further issues for the rails code base and/or runner code base that need completed to resolve this.
James Heimbuckchanged title from Ensure after_script is called for cancelled and timed out pipelines to Ensure after_script is called for timed out pipelines
changed title from Ensure after_script is called for cancelled and timed out pipelines to Ensure after_script is called for timed out pipelines
James Heimbuckchanged the descriptionCompare with previous version
We recently split up this issue for the timeout case (this issue) and the cancelation case which we are researching now! The team completed a POC and we should be scheduling that issue for an upcoming milestone soon to simplify the process for running after_script when a job is canceled.
thanks for the ping @yofeldstein. We recently split this into two issues, one for the timeout case (this issue) and one for the cancelation case which we are working on now.
As a former Linux/Unix developer, I appreciate the complexity of handling zombie processes and their zombie progeny. However, the value to customers with complex pipelines is hard to overstate. Manual clean-up of timed-out pipelines can take many hours and is fraught with error - not to mention users may lack the correct permissions in some cases.
I work with performance testing and much of the work is to find the optimal configuration for a given stack. Also ensuring services survives overflow load. But gitlab do not handle this well, because the container becomes unresponsive for some time and screen buffers are overflowed so we do not now a thing.
Why interested: The use case is unlocking terraform states and possibly stopping Code Deploy deployments that are taking too long in AWS, although in general, clean up could be pretty useful across a variety of different scenarios.
@avonbertoldi as far as I can see they are unrelated. gitlab-runner!4335 (merged) is about to having a timeout for after_script and this one is about calling after_script in case of script timeout.
Why interested: Customer is evaluating Ultimate and has followed this issue since inception.
Impact to the customer of not having this: Could impact Ultimate sale or use of GitLab going forward. Hurts confidence that GitLab listens to customer feedback when they will inevitably be contributing to more feature requests in the feature
Questions: Is this feasible as a GitLab feature? Timeline?
It's not moving forward no matter how many Platinum customers have requested it, but I still love the team spirit on this issue: we are even celebrating anniversaries.
RUNNER_SCRIPT_TIMEOUT makes sure that your script doesn't run until job timeout and gives time for after_script and artifact uploading etc.
If your job timeout is 1hr, set the variable RUNNER_SCRIPT_TIMEOUT: 50m and it will always ensure that after_script can now run before the job timeout is exceeded. We simply cannot run after_script if the job timeout is exceeded: there's no time left to run anything.
We could potentially make RUNNER_SCRIPT_TIMEOUT always be 5 minutes relative to the end of job timeout, always ensuring after_script is given time to execute. I'd like to hear feedback on this proposal. My worry is that it will affect existing long running jobs where after_script and artifact uploading isn't required, as we'll suddenly be cutting them 5 minutes earlier.
What is the default job timeout? I've never set one.
Would it not be convenient to set job timeout to RUNNER_SCRIPT_TIMEOUT + RUNNER_AFTER_SCRIPT_TIMEOUT if either is set? And then maybe defaulting to 1 or 5 minutes for the RUNNER_AFTER_SCRIPT_TIMEOUT if that wasn't set.
And if someone configures all three, well, they knew what they did.
@ajwalker What you've proposed sounds like an excellent interim fix.
I would probably make it a variable that can be globally raised for environments that may take additional time. I can't think of a reason why an additional 5-10 allowed minutes on a job would be a killer for any environment.
What is the default job timeout? I've never set one.
The job timeout is set by the Runner administrators. Our Linux SaaS runners have a default timeout of 1hr I believe. I think this differs for MacOS and Windows, where jobs typically run longer and can be difficult to shorten.
The job timeout is important for Runner administration. It's a way of controlling cloud costs and provides a nice boundary for us to manage updates and maintenance around. If we stop a Runner from accepting jobs, we know that it will be completely free of workloads in 1 hour, for example.
Users can set a job's timeout to a value smaller than the Runner manager's config by using the timeout property of a job. The job timeout is a hard cutoff for any part of the job execution though, so if it's set to 5 minutes and a job's docker image takes 6 minutes to download the job will be terminated really before it has even begun. This is why we recently added control over script and after_script timeouts.
I can't think of a reason why an additional 5-10 allowed minutes on a job would be a killer for any environment.
I'll create an MR for this and share it with my team. I can't think of many downsides to this default either, and should a timeout occur, I can update the error message to link to documentation on how to better control both the job, script and after_script timeouts
@ajwalker Did you get around to creating the MR with the updated defaults you've mentioned? I couldn't find it linked in the related items section, so I thought I'd ask.
I'll create an MR for this and share it with my team. I can't think of many downsides to this default either, and should a timeout occur, I can update the error message to link to documentation on how to better control both the job, script and after_script timeouts
Why we are interested: We conduct (among other things) nightly tests in GitLab CI, in which devices are connected in the device farm (similar to &10158 (closed) but not in MRs, but on the main branch). For this, a test is set up in the device farm, which then waits for a device to become available. If no device becomes available, there is a timeout in the GitLab CI job, which cancels it. Again, at the moment there is no way to clean up the test and the device in the device farm, which the aforementioned Issue 15603 would improve.
Current solution for this problem: Manual cleanup which is very time consuming
Actually @ssichynskyi, gitlab-runner!4491 (merged) should be enought to close this issue. Since you're interested in this issue could you please check if the latest runner fixes this issue for you? If it does, you can have the honour of closing of this long-running issue
@avonbertoldi would love to. As I understand, this is not a full fix, but rather a workaround that helps (using some additional config) to mitigate the problem. Therefore, could you help me with:
the docu on how this actually works (what else is required)
when and how to test this, as this is not obvious from contribution page
@ssichynskyi Actually, gitlab-runner!4491 (merged) will automatically (with no additional config required) guarantee 5m of the total script runtime to after_script. So no special config is required. This change is in %16.8, so you can use your preferred 16.8 os/arch runner binary (here) or docker image (here). Easy way to test to create a job with a
a specific (short) timeout:
a simple after_script (like echo "hello after_script")
a script that will blow past the configured timeout (can be as simple as sleep)
Run that job, and you should expect to see the hello after_script in the job trace.
Note!, I'm not sure what version of runner is deployed on GitaLab.com, so you're probably better off running your own runner.
Let me know if you need more help getting that going.
...Executing "step_script" stage of the job scriptUsing docker image sha256:5badd1edd27f6e432b6c58d4926c6528591d197a187f09d83f6137f7478c3dd3 for docker:git with digest docker@sha256:69d81311a8aa6233893bb5a9a20016115e6e5c1986fa70330ecfa066aeee8339 ...$ sleep 90TerminatedERROR: Job failed: execution took longer than 1m0s seconds
@ssichynskyi my bad. The default time set aside for after_script is 5m, so if the job's timeout is less than than... the mechanism won't kick in. Here's the result with 6m timeout and sleep 3600:
Running with gitlab-runner development version (HEAD) on pekara-linux-shell XXXXXX, system ID: XXXXXXXResolving secrets00:00Preparing the "shell" executor00:00Using Shell (bash) executor...Preparing environment00:00Running on conbini...Getting source from Git repository00:01Fetching changes with git depth set to 20...Reinitialized existing Git repository in /home/avb/workspace/gitlab/runner/builds/RiRtW8MT/0/avonbertoldi/test-project/.git/Checking out dae32610 as detached HEAD (ref is after_script_and_timeout)...Skipping Git submodules setupExecuting "step_script" stage of the job script00:59$ sleep 360Running after_script00:00Running after script...$ echo"Status of Testing stage is '$CI_JOB_STATUS'"Status of Testing stage is 'failed'Cleaning up project directory and file based variables00:00ERROR: Job failed: script timeout context: context deadline exceeded
Alternatively you can set e.g. RUNNER_SCRIPT_TIMEOUT=30s in the job's variables to change the default:
Running with gitlab-runner development version (HEAD) on pekara-linux-shell XXXXX, system ID: XXXXXResolving secrets00:00Preparing the "shell" executor00:00Using Shell (bash) executor...Preparing environment00:00Running on conbini...Getting source from Git repository00:01Fetching changes with git depth set to 20...Reinitialized existing Git repository in /home/avb/workspace/gitlab/runner/builds/RiRtW8MT/0/avonbertoldi/test-project/.git/Checking out 68ca1ddb as detached HEAD (ref is after_script_and_timeout)...Skipping Git submodules setupExecuting "step_script" stage of the job script00:30$ sleep 60Running after_script00:00Running after script...$ echo"Status of Testing stage is '$CI_JOB_STATUS'"Status of Testing stage is 'failed'Cleaning up project directory and file based variables00:00ERROR: Job failed: script timeout context: context deadline exceeded
@avonbertoldi, thanks! I did a couple of tests and I find that the current implementation is a bit confusing and introduces a breaking change because this 5-min time delta is subtracted from the overall job timeout. The closer this timeout is to 5 min, the more the impact.
Example: If I have a job with timeout of 6 minutes and an after_script, now the job will run for only 1 min before termination because of timeout and launching the after_script.
Another problem is configurability of this timeout. If my after_script takes longer, then 5 min, this change is useless.
So, to summarize, I believe that:
after_script timeout shall be configurable
after_script timeout shall be added to job timeout, not subtracted
Note that this 5 minutes is reserved even for jobs without an after_script defined. We now have jobs with 10 minute timeouts set failing in 5 because of this logic.
I also think that after_script timeout should be simply a configuration which adds to overall timeout. When this additional timeout is not configured, then we can have current behavior.
Since the conservative approach in this implementation by GitLab exists to simplify their Ops needs, perhaps an instance-wide upper limit makes sense. That way GitLab.com employees can still ensure that when they drain their Kubernetes pools the jobs end in a reasonable time interval.
@ssichynskyi@mitar the after_script timeout is configurable by setting RUNNER_AFTER_SCRIPT_TIMEOUT on the job. Similarly the script timeout is also configurable by setting RUNNER_SCRIPT_TIMEOUT on the job. IIRC, subtracting RUNNER_AFTER_SCRIPT_TIMEOUT from the overall timeout rather than adding it was a technical limitation. 5m is a default that works in most cases since most CI jobs are well longer than 5 minutes. Note that even if there is no after_script defined on a job, there will still be artifact and cache uploading and sometimes other cleanup, all of which must run within the RUNNER_AFTER_SCRIPT_TIMEOUT envelope. As a general rule job timeouts should not be set to be very close to the actual job runtime since this will leave no time for cache and artifact uploading, etc, and is more likely to result in sporadic job failures due to timeouts when a job randomly takes longer than usual.
@ghostlyrics Neither this settings nor the implementation were added for GitLab employee's convenience. We don't use kubernetes in our shard runners, so there's no "draining of kubernetes pools".
With timeout:, RUNNER_AFTER_SCRIPT_TIMEOUT and RUNNER_SCRIPT_TIMEOUT you've got all the knobs you could possibly need to configure the time parameters of your jobs.
We added RUNNER_SCRIPT_TIMEOUT and RUNNER_AFTER_SCRIPT_TIMEOUT in gitlab-runner!4335 (merged) which is not being reverted.
Customers will still be able to ensure that after_script is called, but it won't be the default behaviour. We're reverting the change that effectively made it the default.
With the functionality we now have, I still think we can close this issue. We're providing variables that allow affected customers to work around the problem. Any fix/introduction of a default behaviour is unfortunately going to be a breaking change.
Along with the revert of a default timeout, we're also adding a log message that points customers to documentation on how to better control script timeouts when they occur.
gitlab-foss#34861 (closed) was closed as a duplicate of this issue, as carlosaya noted below. I'd really appreciate not waiting another 8 years for the sake of a typo when someone migrated the issue.
We have an epic for handling this for cancelations: &10158 (closed). I think the most recent issue tracking the current unit of work is #437789 (closed).
I appreciate this is hard to track. But this issue closing won't affect this other, but very similar, problem.
Trying to follow the thread here. The original issue I was interested in was gitlab-foss#34861 (closed) (Run a job when build is cancelled). That issue was then marked as a duplicate of this issue (perhaps incorrectly?). It seems like the raised MR to close this out is only related to running a script after timeout.
I am looking for a way to run a job when a pipeline is cancelled. We are using Custom Executor which spins up VM's on demand for jobs. When users cancel their pipelines the VM's become orphaned and I need a way to tear them down.
Have I missed something in gitlab-runner!4335 (merged) that would address this need? Or is there an existing solution or workaround I should be aware of?
I don't think a project's CI scripts should be concerned with tearing down the VM the runner sets up for it (unless the setup is in the CI project config as well, not just "use image/snapshot X"); Docker-using jobs don't need to docker stop the container its image ends up inside of. I think your Custom Executor needs to tear it down when a job ends (for whatever reason). I'd look at how Docker or other VM/AWS executors do it.
Thanks for the response. Right you are! I forgot that we customized our cleanup task so that it only ran after all stages/jobs in a pipeline have completed which is what is causing our issues with cancelled jobs. I'll update our logic to cleanup when a job is cancelled as well. Easy
8 years later we still don't have the ability to execute a job on pipeline cancelation. gitlab-foss#34861 (closed) was closed as a duplicate of this issue, as carlosaya noted above. Please just put us out of our misery so we can select another product, leaving these basic functionalities hanging for almost a decade is seriously hindering organisations.
Since you were the one closing the issue, could you please clarify whether this issue will be reopened since the changes were reverted, or will there be a new set of changes in the scope of the separate issue?
Well, @ajwalker, sorry, but I can't find anything saying To run after_script after manually canceled script execution do X, Y, and Z. How should setting the RUNNER_SCRIPT_TIMEOUT and RUNNER_AFTER_SCRIPT_TIMEOUT variables allow users to run their custom after_script commands after the script section was manually canceled?
@d2375310 Think of it as a knob or switch instead of a hack :wink The mechanism is there to enable it; the policy to enable it is still a work in progress.
Sorry guys, but it seems I've been mistaken. I saw this
after_script keyword will run for cancelled jobs
from the 16.8 announcement page and somehow came to this issue, and thought that you had reverted this announced change about the canceled jobs, but I see now, that I was wrong, and found the correct issue - #437789 (closed).
The only question I have now, is in which version of GitLab this feature after_script keyword will run for timeouted jobs was added, can someone please tell me?
I think this depends on the use case, and arguably, if this is a valid analogy for yours, then perhaps it's not our fix that is the problem, but your script.
It's worth remembering the context here: Your script has exceeded the job timeout.
In most cases, your script shouldn't exceed the time allocated for it to run.
Where this is unintentional we're providing a mechanism for a best effort attempt to clean it up.
Where it is intentional (such as a long running fuzzing job), these control knobs are probably what you're looking for.
If you're hitting the job timeout, and adjusting the script timeout isn't a robust enough solution, I'd recommend having a separate job in the pipeline that can clean up any resources that were created etc.
Having said that, these were control knobs were added in this way (as job variables) because this path had very little friction and we knew that it was useful for certain use cases. Us finding a solution doesn't need to end here if it's not suitable for everyone.
A common suggestion I keep hearing is that after_script should run even if it exceeds the job timeout. We didn't do this because it makes the existing job timeout meaningless: you're still running the job if after_script can run after the job timeout. Having this would be a breaking change. However, I suppose if setting RUNNER_SCRIPT_TIMEOUT still isn't working for you, and we think that a hard-coded after_script timeout, that always runs, is somehow the solution for that, we could discuss adding a flag to enable that behaviour. If you'd be interested in that, I'd recommend opening an issue with the proposal and use case it solves that RUNNER_SCRIPT_TIMEOUT does not.
@ajwalker@jreporter Why Gitlab documentation is always so bad? How after_script is gonna work in 17.0+, can you please explain in detail here AND also fix your documentation everywhere? In the Gitlab 17.0 deprecations announcement I see the following:
The after_script CI/CD keyword is used to run additional commands after the main script section of a job.... In 17.0, the keyword will be updated to also run commands after job cancellation. Make sure that your CI/CD configuration that uses the after_script keyword is able to handle running for cancelled jobs as well.
If a job times out or is cancelled, the after_script commands do not execute. An issue exists to add support for executing after_script commands for timed-out or cancelled jobs.
It is super confusing, why you produce a lot of work for everyone who has to read Gitlab (breaking) changes updates?
@dzavalkin-aboutyou - I am sorry that it is unclear! We will get an MR up to make this more explicit about time out and not just cancellation
Given that the 17.0 release has not completed, we have not release any changes. At the end of 17.0 we expect after_script to run when cancelled statuses. it will not run on time out.
@rutshah@allison.browne - can we please make sure to prioritize this issue as follow up to 17.0 and make sure the docs are updated and clear? Thank you!
I'm sorry @jreporter, but do you think it is ok that we all now see a huge banner on gitlab.com about version 17.0 upgrade and first breaking changes window ends today (i.e. some breaking changes are already applied as far as I understand):
The first breaking change window begins 2024-04-22 09:00UTC and ends 2024-04-24 22:00UTC.
but documentation about upgrade is still not ready? Do you think it acceptable level of product quality?
@jreporter The reason why the documentation should available early is so that we can prepare for the breaking changes and have solutions ready without service interruptions for users. While I don't agree with the tone, I do strongly agree with the frustration voiced by the above users.
I get the frustration, also had my share of eyerolls when looking through breaking changes. But the tone is unacceptable, that's not how you talk to people. Have some patience and respect.
So, @jreporter edited her comment to make me look like a fool? In her original reply she wrote that since release 17.0 is not out yet - why documentation should be ready...
To all people sending me negative comments/emojis - good luck preparing for breaking changes with such high quality documentation...
@jreporter you can just delete all my comments/the whole discussion tree, this discussion doesn't have any value for anyone, especially in the future. I came here not to argue but to just get the basics done by your team - documentation being up to date. I hope it will be done soon for this change and you will apply this approach in the future for all breaking changes...
The reason why the documentation should available early is so that we can prepare for the breaking changes and have solutions ready without service interruptions for users.
We have some changes to infrastructure/the product that arrive on GitLab.com (within the scheduled windows).
We have many deprecation issues scheduled for 17.0 that change the source code and documentation.
As a developer working on a deprecation issue, we typically remove the necessary source code of a deprecated feature along with any documentation that references it at the same time. This keeps the change and documentation in sync, as occasionally what we plan for and ultimately have to do can differ slightly. We continuously release the docs until 17.0 is released. This means that if you view the docs today some sections will already have been updated, but if we haven't yet picked up that unit of work, it won't be. It entirely depends on our progress as we work towards 17.0.
As @jreporter said, the release is not yet available. We're making these changes as we approach 17.0.
Having said that, I agree that if the information in the deprecation notice is lacking and an example of an upcoming change is necessary, the docs are the obvious place for that, and the earlier the better.
@jreporter Do you know if this concern has been raised before? It's probably worth opening an issue for if not? We can't cement the documentation to reflect changes before the release, but we could probably add a note in relevant sections. The only problem is the note would have to make it clear that it affects "17.0" for example and not a current release, but viewing the 17.0 docs would still suggest this until we've made the change. Potentially less confusing than what we have now though?
@ajwalker Thanks for the explanation. I understand that docs are just as much part of the code as the application code is and that you guys do update docs when you work on issues. And parts of the docs of an unreleased version are not necessarily finished yet. The version drop down does say "not yet released" after all.
FYI, my "eyerolling" moments so far weren't about differences between the regular docs and deprecation issues or deprecation docs. It was how there's several pages for changes in different places in the docs tree, depending on what part of GitLab they cover. And I missed one of them once. Was a feeling of "...there's another one?". Not directly relevant to the issue(s) here.
@Thomas.Naeveke@SaberStrat - thank you so much for the additional information and clarifications! This helped me understand the friction points you experience and propose what we can do to make this better in terms of documentation. I have taken @ajwalker's suggestion and opened an issue here gitlab-com/Product#13322 for our teams to action in future major releases.
@dzavalkin-aboutyou - I did not edit any comments. I also did not intend to offend or embarrass you. Thank you for explaining your use case and challenges with our major release and breaking changes. I hope we can iterate forward to make this less stressful next time!
I would love this issue to be fixed. We have a CI pipeline that sometimes reaches the timeout of the job, we would like to add some script in the after_script section to be able to extract some debug information and identify what is blocking the CI pipeline.
At the moment, we run a script in the background waiting X seconds before the timeout and runs some debug, but it's not very helpful.
I configured the timeout of a job to 30 minutes, then configured RUNNER_SCRIPT_TIMEOUT to 25 minutes to be used in the script section and we also have a section after_script.
When the job is executed, it reaches the timeout, but the after_script doesn't get executed.