It's possible to want a after_script step for a gitlab-ci.yml, where it's important that it always run. However, the current behavior of after_script is that if a build times out the cleanup stage is skipped (even though this does not appear to be the original intent: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/102#note_11619214).
Users may be able to make use of the Job Event webhook to shut down or cleanup external environments. It should be assumed that no data outside the webhook payload will be available unless explicitly stored in global variables or some other method because the runner may have lost connectivity or otherwise may not be accessible to provide data required for the desired action to be taken.
Further details
Permissions and Security
Documentation
Availability & Testing
What does success look like, and how can we measure that?
What is the type of buyer?
Is this a cross-stage feature?
Links / references
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@markglenfletcher: When I tested when: on_failure it didn't run that stage for me. I had 2 stages a build and a build_cleanup with on_failure. The 2nd stage didn't trigger when I cancelled.
@ayufan@zj I don't see output from after_script in my logs when I cancel. Do I need a newer GitLab CI runner? EDIT: I'm running the latest runner via Ubuntu apt-get
stages:- testtest: stage: test script: - sleep 120 after_script: - echo cleaning now
When I run normally, I do see 'echo cleaning now'. However, when I cancelled I expected to see the echo cleaning now text at the bottom of my logs.
Update: I don't think after_script: is working as intended.
If I understood well, the 'after_script' concept is the analogue of 'before_script'. It is global, it does not belong to a job, and you cannot apply tags to it as you can do with jobs.
So, the indentation in your example is semantically misleading.
My experience was that the after_script commands were executed after every job (similarly to before_script). Therefore, it is not suitable for clean-up if you have a build and a test stage, because it will run after the build stage, and the test stage will not find anything to test.
@zj is it possible that after_script is running at the right time, but its output isn't being written to the logs on the UI? My jobs are indeed being cleaned up when I cancel, even though the output isn't logged anywhere.
@mespakbefore_script can also be per job. It over-rides the global before_script if defined. Again, I'm not sure if that's intended behaviour or not.
@zj is it possible that after_script is running at the right time, but its output isn't being written to the logs on the UI? My jobs are indeed being cleaned up when I cancel, even though the output isn't logged anywhere.
@mespakbefore_script can also be per job. It over-rides the global before_script if defined. Again, I'm not sure if that's intended behaviour or not.
+1
In our case, we use kitchen ci with kitchen-docker, and for each stage in kitchenci, we have the corresponding stage in gitlab-ci.yml.
When we cancel the pipeline, the command kitchen destroy is never called.
Then we have to destroy manually all the docker instances created by kitchenci.
We cannot apply an after_script after each stage and destroy to recreate/reconverge(kitchen create+converge)/resetup(create+converge+setup)/reverify(create+converge+setup+verify) and finaly destroy... The time duration of the pipeline will become exponential.
To chime in on the canceling -- it could be for a multitude of reasons. We have jobs that are a part of deployments that would want some clean up when canceling.
We would benefit from this cleanup. So a use case is when a job is already erroring, and it's going to take a long time for it to finish, then we usually cancel it. The problem is that the cancellation does not trigger the after_script, and therefore cleanup does not take place.
The after_script doesn't appear to run on cancels from UI.
Also, what's with the zendesk URLs? I get "oops The page you were looking for doesn't exist" errors for both of them. If there's additional relevant info, can we keep it in this ticket and accessible by all?
I typically press cancel in the UI to free up resources from a build I know is going to fail, so that I can use them for the next build which includes a fix.
If I remember correctly, manual jobs can be executed also if a pipeline has been canceled. A possible way to overcome the problem is to make the cleanup job manual (yes, probably you need also another one for cleanup if you don't cancel).
After cancelling a pipeline, you can run the cleanup job manually.
This is not a solution, but may work until it will be implemented.
Agreed that an additional job with when: on_cancel wouldnt solve the problem because the additional job may land on a different runner. Often times the desire for cleanup involves resources on the host that ran the cancelled job. Something similar to after_script would definitely solve the problem for us.
My specific use case is for docker containers, networks, and volumes that will get orphaned if a job is cancelled and then cause problems on future jobs that run on that host.
This feature would be very useful. see gitlab-ce#34861.
This would be great. I kick off a bunch of parent pipelines when a commit is made to a submodule, and it would be good to be able cancel those if the submodule CI is canceled. An on_cancel: option would enable this.
I think the real issue here is that the after_script doesnt actually fire on a timeout or cancellation. I think this should really a bugfix issue making after_script actually always work as an after_script instead of being only ran in certain cases. an environment variable could be set so that the after script can know the status of the build. (success, fail, cancel, timeout)
I'm having an issue with a job freezing up and noticed that when a job is terminated from timing out (1 hour default) the after_script did not get fired then either. Over time canceled and timed-out jobs consume a lot of resources and have to be manually cleaned up.
Canceled/timeouted jobs should call after_script, or a to-be-introduced finally_script. CC @ayufan@markpundsack@bikebilly In fact, finally_script for canceled/timeouted jobs, was promised as part of 8.7 CI Plan https://gitlab.com/gitlab-org/gitlab-ce/issues/14717. It never made its way to GitLab. It's really needed. (We're EEP customers)
There is currently no way to run docker run inside a job, and having these containers nuked on job cancel or timeout reliably. Our job definition is something like this:
Since after_script won't run on timeout/cancel, docker stop won't run.
I also passed --inithoping that GitLab CI would SIGTERM the processes inside the container. --init passes signals to its children, effectively allowing the test runner to react to ^C or SIGTERM by exiting, which /bin/init spots, and stops, stopping the container. Unfortunately, gitlab-runner SIGKILLs (9) its job containers, therefore all children are SIGKILL-ed too (remember, SIGKILL, unlike all other signals, is not a signal, the kernel kills a process right away without letting it know). docker run process, a Docker client process, will get SIGKILLed too. No signal will be passed to the underlying bash -c 'prove -v -I perl -j5 -r t/perl/unit' in a container, because that container is NOT a child of job container.
root@gitlab-ci-runner-3:~# docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES3dd44c50639c git.dreamhost.com:5001/dreamhost/ndn/ndn-base:build-8175 "bash -c 'mkdir $DBH…" 4 minutes ago Up 4 minutes ecstatic_mccarthy38c30ccccf58 5d152d55326b "docker-entrypoint.s…" 6 minutes ago Up 6 minutes runner-f2f196b4-project-8-concurrent-0-build-4
root@gitlab-ci-runner-3:~# docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES3dd44c50639c git.dreamhost.com:5001/dreamhost/ndn/ndn-base:build-8175 "bash -c 'mkdir $DBH…" 13 minutes ago Up 13 minutes ecstatic_mccarthy
As we can see, my explanation matches the visuals. GitLab runner SIGKILLs the job container and its children (docker run client process) and that's it. No real signal (e.g. SIGTERM or SIGINT) is ever passed to docker run process, making it unable to inform the init process of my container that it should stop working.
Lack of finally_script and SIGTERM-first is a serious flaw, making it impossible to use Docker-in-Docker reliably.
tl;dr with action points
GitLab Runner doesn't run after_script on timeout/cancel, making it impossible to stop Docker-in-Docker (via docker.sock mount) containers manually using docker stop.
Action point: implement finally_script, or make after_script run on cancel/timeout.
GitLab Runner uses SIGKILL, which makes it impossible for Docker-in-Docker (via docker.sock mount) containers to terminate gracefully when asked to.
Action point: SIGTERM your job container first, SIGKILL it after 15 seconds or so if no reaction.
Action point: start GitLab job containers with --init so SIGTERM signals are passed to children. (I think it's already a case, docker inspect [containerid] of a container that looks like runner-f2f196b4-project-8-concurrent-0-build-4 says Init: true).
.
Implementing (1) (finally_script) makes it possible to manually terminate DIND containers using docker stop command. This is a full solution that always works IF a user is aware they have to always docker stop in finally_script. It's not a common knowledge. But it's a technically complete solution.
Implementing (2) (SIGTERM-first approach) makes it possible to terminate DIND containers that are executed with --init switch, and where internal processes react to signals. In general, the community isn't well-informed about the --init so not everyone uses it, but many do. Also, while tests runners typically respond to signals and terminate gracefully, not everything does - by design, coding incompetence (lol), or due to a hung process.
All in all, is only (1) implemented, or both (1) and (2), every place in GitLab documentation that discusses DIND should provide a guidance on using docker stop in finally_script, and using --init for convenience.
@ayufan@markpundsack@bikebilly We're EEP customers and need at least (1) really bad. This wasn't until yesterday/today when I realized why our GitLab runners have been unreliable for us for months. And why I observed 100% CPU usage across 8 cores on two runners multiple times, while other runners (we have 6 in total) were completely idle. Now we know the reason: jobs are scheduled to runners based on how many jobs a particular runner is running. Not by CPU or memory usage of the host. Some runners can be overloaded with 100% CPU due to some unit tests being run in DIND that were not terminated on cancel/timeout. But the runner thinks there's no jobs running, and it's the best candidate to pick up the job. Then people get frustrated a job that normally takes 20 minutes is taking 90 minutes, cancel it, rerun, and what's happening is the retried job gets scheduled to that "perfect" unoccupied runner, that is in fact terribly overloaded. Other jobs timeout, but the DIND containers remain, and it's just a disaster.
I also wanted this feature for my use case wherein one of the black box testings we spin up a set of docker compose which needs to be cleaned up after it is done whether it is a success, failed, or cancel. And it will become a PITA for my team to manually delete the stuff in the CI machine.
Right now we have a fix for this by:
Suppose I have a job called blackbox, then
I create a when: manual job called cancel_blackbox in the first stage of my build.
stages:-cancel_blackbox-blackbox
So, when I need to cancel, I run the cancel_blackbox instead of clicking the default cancel button in the CI build log UI. It may fix the problem you have @abuerer.
Actually, I love this product and I want to help implement this feature on the CE version if possible. Whom can I talk to start this feature?
We give a deadline of likely 10s, when we send SIGKILL and you can do a graceful shutdown. Making traps work it will allow you to use of graceful shutdown. Maybe SIGKILL gonna be configurable with some limit, like up to 5 minutes, but by default be 10s.
I've gone through and updated the description of this issue with details from the comments here and elsewhere. Please take a look if you're interested in this issue and jump into the discussion if you have any feedback.
There is a second issue discussed here related to GitLab's use of SIGKILL in Docker in Docker scenarios. I feel this is a separate issue so I have not included it in the resolution for making after_script work in this issue. I'm open to hearing if they are in fact fundamentally tied together.
I need this a lot. i have some jobs to view server logs, but there's no interactive interface to stop the log, and there's only the Cancel button. But after pressing cancel button, the log is still running on my server. And more importantly, the gitlab-runner gets stuck with the SSH tunnel and no more jobs can be run, I had to run sudo gitlab-runner restart usually.
If there were the final_script, I could clean up the logging process. Presently, I have to use sleep on a background process to kill the log after a short period of time.
That said, every person who uses DIND with socket as volume will have to be aware of having to use docker stop in after/finally, or risk a "leak". And GitLab documentation related to DIND should reflect that. Which is probably more work than simply implementing SIGTERM-wait 30 seconds-SIGKILL approach. There's an MR for it, and while it took a weird approach to achieve it, it's a proof a code change is extremely minimal. It's a waste of time to not go for this feature! :) gitlab-runner!987 (diffs)
@Nowaker I understand that that's an issue, but as best I can tell it's a separate issue vs. the one this issue is intended to resolve (that after_script or something like it is not called on timeout or cancel, regardless of using Docker in Docker or not.) Or am I misunderstanding?
Using Kubernetes executors, a pod with build, helper and service containers continues to work if we cancel the job. That pod needs to get killed as well.
This is an important issue, but given limited capacity on the team we have a number of other gitlab-ce3857523 items to work through before we move on to gitlab-ce3857529a. Moving out to %12.0 for now.
This is really needed to deal with PR's and deploying tests to Kubernetes. Otherwise, a user can cancel things and tie up resources indefinitely. We're seeing folks adopt Jenkins over Gitlab due to Jenkins being able to safely always delete temporary resources during cancel.
I'm also looking for this (or something similar). My use case is that I'm provisioning infrastructure (say VMs or Packet.net servers via Terraform) to run my CI. That costs money and I want to be absolutely sure the infrastructure is removed when the CI job completes (except for some extreme scenarios).
after_script seems like a very basic way to achieve this.
Azure Pipelines has a very powerful syntax where you can divide your build in different steps and specify conditions for each step. This allows you to create steps which must always run, steps which should only run on failure, etc.
It would be great to see something similar in GitLab.
Azure Pipelines has a very powerful syntax where you can divide your build in different steps and specify conditions for each step. This allows you to create steps which must always run, steps which should only run on failure, etc.
It would be great to see something similar in GitLab.
I believe our model for this will be started with gitlab-ce#47063
Azure Pipelines has a very powerful syntax where you can divide your build in different steps and specify conditions for each step. This allows you to create steps which must always run, steps which should only run on failure, etc. It would be great to see something similar in GitLab.
I believe our model for this will be started with gitlab-ce#47063
I think that's something different. It looks like gitlab-ce#47063 is about implementing something which is similar to job dependencies in Azure Pipelines. Given a job, you can specify its dependencies and the conditions for those dependencies (e.g. 'run when job A fails' or 'run when job A succeeds and job B succeeds').
Steps and conditions help you split a single job in multiple steps, and specify conditions.
E.g., a job starts with creating a VM, a container, k8s objects, runs tests (and passes/fails based on the results of those tests), but always removes the VM/container/k8s objects once the job has completed.
We moved from Azure DevOps (then VSTS) to GitLab about 1,5 year ago because the CI/CD infrastructure in GitLab was far ahead at the time. However, Azure DevOps now provides CI/CD features which are missing in GitLab.
I think that after_scriptfixing and final_scriptadding you are talking about - are very different things, which have no common withcancel the job event. You are thinking only about your restricted cases.
But i want to share with you another one, the case:
When somebody cancel a test execution job, team needs a notification, that job was stopped.
Because team have two notifications - start and stop, described in job. So, here are two cases:
Let success start-finish the job:
before_script: notify about job started
script: job with test execution
after_script: notify about job finished (successful or not, does not matter)
Canceled job which has been started:
before_script: notify about job started
! CANCEL !
no notification, that job was stopped by finishing or interrupting
So, i want to see on_cancel_script in my jobs. Thank you. Hope you will support this idea.
@jlenny @markpundsack@ayufan Can you guys think about scheduling this one? It feels like this is currently the biggest flaw of GitLab CI, in my opinion, and needs fixing this year, rather than in 2020 or 2021, if even that.
It seems like change build behavior to always run after_script is a popular solution. This change is relatively simpler to implement at the job level, but I think it raises a few questions:
If a Pipeline with 100 jobs is started and immediately cancelled, we potentially have 100 after_scripts that will still be queued to run, so is that Pipeline really cancelled? If that behavior is okay, we need to come up with a more descriptive command than "Cancel Pipeline", perhaps "Skip all Build scripts"? "Cancel Pipeline" might remain a true stop-the-presses command, but become a much more drastic action to take, i.e. "Crash Pipeline".
If we "Skip all Build scripts" (or whatever this cancel-but-run-after-script command is called), does that also continue to execute before_script for each Build? If we're talking about moving after_script into it's own special dimension outside of a Build script, it might be counter-intuitive to have it's pre- counterpart remain part of the Build.
Alternatively, we could leave before_script and after_script as part of the Build, to be abruptly cancelled at will, but provide a different method for setting up and tearing down resources at the beginning and end of execution of an entire Pipeline. This would look more like a distinct build or stage, before_pipeline / after_pipeline. While these section would have to do much more dynamic checking about resources that exist and figuring out how to ensure everything's been taken down, it might offer a much cleaner execution structure for the Pipeline as a whole.
Personally, the way I see the flow working is that for each job that is actually run you have a series of the steps, like:
global before_script
job before_script
job dependencies
job script
job after_script
global after_script
... now the problem ATM is that in case the job is cancelled during script, none of the others are being called, as in everything is being shut down at that particular point in time. The problem is that when you have a cleanup that you really have to do and included that in either the job's or global after_script it will not be called.
... therefore I don't see you're particular problem of creating xxx amount of after_scripts, but rather in case the cancellation is being identified, break the script part, but do continue with the afters until the CI's job is actually being considered finalized.
So the execution of after_script for a job would depend on whether or not before_script or script has started? I think we would have to introduce intra-job states to be able to confidently know whether a job has started, and from that whether or not to run the after_script of that job.
I was under impression that a single job is like a thread. It has a prerequisites, actual task and winding/cleanup. If it's not the case then yeah, having at least 3 states would be beneficial. Startup for something that can't be interrupted, actual task (defined by script, something that can be interrupted) and then the post-process (whatever needs to happen afterwards - artifacts, cleanup, both?).
As in the yaml is there to define the structure and some relation/dependencies. However is being built upon in the final form is irrelevant as long as it follows some understandable flow that will ensure that the most important tasks are performed for each of the jobs (before-task-after). At least that's what matters for me.
Also thats what I've experienced working with say Busted for Lua testing. It allows for defining a describe which explains a group of tests, then before_each for something that is being runned before each test, after_each as a way of cleanup and finally it for the actual test. You might think I'm asking for a lot, but to me it's a simple trio of tasks that have to happen, and only one of them is interruptable (or something that could fail).
A lot of people at my company click "cancel" to cancel a deployment, but clicking cancel doesn't actually stop the deployment, since gitlab is just calling an api to initiate it. There is a way to cancel a deployment, but a script needs to be called.
In other words, I would expect, and want the "cancel" button, to still run the after scripts, or some teardown of some sort. I do not think anyone expects a cancel to be a "stop the world, kill the container". If a job went into a "canceling" mode, that would be very reasonable, I feel.
before/after pipeline steps are not nearly as useful, as the teardown is usually very step specific
Just adding another vote for this issue. We are using Terraform to create VMs on OpenStack, so if someone hits cancel or the job times-out (my case), those VMs have to be manually cleaned-up which is a huge pain.
I have an shell-executor with before_script, script and after_script. If i cancel the pipeline, the job still continues before_script and script, but not after_script. It seem's a reliable (instant) termination of the job is required instead of going for a solution with after_script only.