This feature is now a candidate for 16.4 at the earliest as we still need to refine the implementation proposal.
Description
Allow configurable after_script timeouts so that post script cleanup can be done here even when it takes longer than 5 minutes.
This can be nice for semantically breaking up when things happen but also if you want something to always run even if the script failed but it takes more than 5 minutes.
I'm currently just moving it into a script action because I don't really need it to fire on failure (waiting for an ECS deployment to finish draining previous tasks).
Proposal
The after_script timeout is currently hardcoded to 5 minutes. It would be nice to make this optionally configurable just like the main script actions are.
In the original issue raised about adding after_script it was noted that this was planned to be some arbitrary timeout with a later configuration option
Links to related issues and merge requests / references
We have rather heavy test environment tear-down actions (shutdown WebLogic server, collect all logs) followed by upload of test execution reports (500MB) to a dedicated server which needs longer to complete.
P.S.: the fact that after_script timeout is not reported in anyway by the Runner makes the issue that much harder to pin down
The fact that the 5 minute limit is not mentioned in the documentation and especially that no error message is shown anywhere is really a showstopper...
Guys, it's a real flaw, that this limitation is not mentioned in the doc and there is no error message, when after_script killed by timeout. Please, make the option configurable, properly document or, at least, throw an error/warning that task is killed due to timeout. I've spent few days already trying to figure out, why the script is randomly stopped...
This is a 2 years open bug and to be honest I am frustrated with gitlab easy to fix bugs which are just pushed aside in favor of chasing trending tools.
The solution can be in 2 phases
change it to 20 min.
later add configurable option even tough the solution is not a hard coding task .
@steveazz I just saw you talking about this very topic on another issue. I understand the limit is 5 minutes because the after_script is meant for a cleanup rather than a large planned set of work. What are your thoughts on just increasing the maximum a small amount? cc @joshlambert
I'm not sure if increasing will be ideal, as pointed out the after_sciprt should be only used for cleanup/deleting anything that the job created, it shouldn't be used for anything else. I'm not sure I understand the need of having it higher to just delete environments/cleanup. after_script is also called on failures, now if a job failed that means either the user's code is wrong or something is wrong with the environment if something is wrong with the environment the after_script is most likely just going to timeout as well because it would hang, which just lead to cascading problems because the pipelines take longer to fail and the queues are backed up, now even if we allow the customer to configure this it's just going to cause them the same problems. Having configuration also leads to problems that we specify in https://about.gitlab.com/handbook/product/#convention-over-configuration
I'm really curious why users require after_script to be longer then 5m what is so big about their cleanup that needs more than 5 minutes? Why does a customer need more then 5 minutes to delete files and clean up the environment?
One scenario I see if uploading logs on failure, if we want to do we can always use artifacts, specifcally on_failures which is something designed to upload files to be used for debugging after the job has finished.
I feel like we are not really communicating well the intent of after_script:
Sometimes there is a need for a longer tear down actions at the after script .... I fail to realize such a limitation maybe if it was something like 10 min it would make more sense but 5 can be too short.
The after_script is a good place to start collecting logs and most of the time you want to collect logs after the setup have been taken down which sometime can take time especially if you want to do it right.
I am puzzled many times with the limitations and the usability decisions just taken by gitlab as if they are working in the theoretical level without collecting any field feedback at all or in this case ignoring it for 2 years even-tough the fix is very simple.
The after_script is a good place to start collecting logs and most of the time you want to collect logs after the setup have been taken down which sometime can take time especially if you want to do it right.
Isn't this something that artifactirs of failure can archive? How are you gathering the logs?
First we are self hosting gitlab and it generates the following problem.
We are generating several gigabytes of logs per job run and push them to S3.
We know that artifacts could be pushed directly to S3 via configurations.
But the problem is the retrieval of those logs it absolutely kills the gitlab server so we are not using the artifacts mechanism for logs and directly pushes them to S3.
personally i think that the artifact mechanism is good for logs up to 150 MB.
Still before we collect logs we need to tear down our setup which take some time for now it is about 4 min and might take even more in the future.
The point is that 5 min can be sometimes too short and you can see at this thread that i am not the only one complains about it so there is a need.
I took a look at the gitlab-runner code just change it to something like 10 or 20 min is just as simple as chaining 5 to 20.
Making it an option is a little bit more tricky still possible I assume like most of the runner basic configurations regarding timeouts,description ... you would like it to be configurable from the gitlab therefor you need gitlab to support it and the API of course.
So a quick fix is to change the timeout and help gitlab users either if the are using EE or CE not to jump trough hoops just to collect logs or do some tear down actions.
Still in general as I said i do not realize the logic of gitlab decisions many time just taking decisions as if you ignore completely users feedback.
Another example we suffer from gitlab#18283 (closed)
After an upgrade i just noticed that the default behavior was changed and that there is no option for backward comparability and it is there for 2 years and it does not seems to bother anyone.
Thanks for the detailed explanation @tal13 I appreciate it, that helps make the picture a lot more clear on what is happening and helps us make a better decision.
I think we can start by bumping this to 10minutes for now and see if that improves the situation for most users, increasing it a bit higher might just lead to longer pipelines for most users, but we can always increase it later again if we see that it simply is not enough.
What are your thoughts @ayufan@tmaczukin should we increase after_script to 10min instead? I've already raised the concerns of doing this in #2716 (comment 219927166) but given the strong need from the rest of the community we should consider increasing it.
Adding to the above @tal13's comment:
We have run into a similar issue: our after_script originally included upload of build reports to Artifactory (GitLab just can't hold all our test jobs reports exceeding 500MB each) along with shutting down WebLogic server that we just ran our tests against.
That very soon started taking more than 5 minutes and resulting in half of the artifacts suddenly getting lost without any error or warning.
As a work around we had to implement following bash function:
This is really hurting us. It took days for me to figure out this is what was ending my CI job and only after a lucky google. There is nothing about timing out in the debug logs for the gitlab runner itself or the job and as we already know from this ticket there is nothing in the documentation either.
Our use case is pretty common and straight-forward. We're testing our terraform automation by building an ephemeral product deployment. We don't want to be left over with 100 AWS objects after every test so the after_script: does a terraform destroy to remove the AWS objects we created. This typically takes 20-40m due to AWS operation times. I would think this a common enough thing to warrant a configurable after_script timeout option.
Would just like to point out a workaround (or maybe even intended functionality) we have used to achieve some of the problems listed here.
If you are using the after_script declaration to manage ephemeral deployments like @jwitko mentioned, or in general you are trying to have a clean-up job that runs regardless of job success/failure, it might be better to use the when: always configuration for a "cleanup" job definition.
This cleanup job could be it's own stage at the end of the pipeline that has the when: always configuration option set. This setup has no timeout issues.
There are a couple of potential minor downsides:
The job/stage needs to be defined immediately after the job you want to "cleanup." I imagine for most of the cases this isn't an issue, but the when:always job runs regardless of the status of all prior stages/jobs before it in the pipeline. This is important to remember, bc if the pipeline fails before the "creation" job, your cleanup job will run anyways even though it has nothing to clean up!
Since you are defining a new job, it may mean duplicating some things like environment variables. The script context was already missing from an after_script definition, so I can't image this is actually an issue for people.
This does have the added benefit of being more explicit in the pipeline UI and allows you to be much more customizable in terms of what the cleanup job is or isn't allowed to do. You can now use all the CI features that normal jobs have (i.e. if your cleanup job fails, you can have it throw an error and stop the pipeline - this may be desirable in some cases since after_script currently cannot do this.)
I do think the 5-minute limit is safe and probably a good practice, since more advanced cleanup jobs should opt to using its own defined Job resource with the when: always. Ephemeral deployment pipelines certainly seem to fit into this advanced category, as you'll want more control over the job and assurance that the cleanup actually happens. But I wouldn't mind seeing an option to override the hardcoded limit and simply putting a warning in the documentation.
And to expand on the comment about a newly defined job and throwing a job failure, we should consider the terraform destroy example posted. If "terraform destroy" is in an after_script and there is some sort of odd bug where "terraform apply" succeeds but "terraform destroy" doesn't, the pipeline and job will succeed. Instead, you'll definitely want to know that the destroy command failed and throw a pipeline failure. Otherwise you'll be merging code that can't be destroyed in case of a future rollback! Or you'll have a surprise bill at the end of the month from your cloud provider about the hundreds of ephemeral environments that were never shut down :)
That kind of job failure logic currently can't work in an after_script.
By chance I landed here ... We had a complicated .gitlab-ci.yml-structure with a lot of overloads and overrides.
So we are forced to use after_script for a docker-build.
It takes me a long time to find out, that the after_script has a timeout which is not equal to the pipeline-timeout. There are no hints or something and it is not even failed.
The end of our pipeline looks like this and I wondered why there docker-image isn't there:
INFO[0242] Adding exposed port: 8080/tcp INFO[0242] No files changed in this command, skipping snapshotting. INFO[0242] Pushing layer image-location to cache now Cleaning up file based variablesJob succeeded
For me, it's completely not acceptable that the job finishes with no error or notification.
Some details to add to why this feature for me is very important:
I use included function which I extend to a local job for reusability.
And i use after_ and before_ script to allow my colleagues to customise the job.
For example, I set up an auto preview which deploys a review version of our app.
For one of the projects we need to seed the database so i wait and seed in the after_script, but that's not possible because of the hardcoded 5min timeout
I'm really curious why users require after_script to be longer then 5m what is so big about their cleanup that needs more than 5 minutes? Why does a customer need more then 5 minutes to delete files and clean up the environment?
@steveazz here's another use case:
I need after_script to sync some directories into s3 because of a number of other gitlab limitations I must work around;
Cannot use cache:key as configured for s3 because I want a directory of objects in s3, not a zip of the whole directory in s3 as 1 object.
running as non root via exec su <user> -c "XYY" upon return the job finishes as a success. regardless of any lines in script: [] after that point
Cannot run a separate stage because it'll be another container and I'm trying to avoid copying gigabytes around in the first place, so using artifacts are out.
Basically the way random parts of gitlab work different to their sister options makes this stuff very frustrating. s3 sync ./dir/* s3://bucket would be about perfect for after_script
Background: Self-Hosted gitlab, I'm using the kubernetes runner on our own cluster (that gitlab itself isn't in).
We use resource_group to mitigate concurrent use of cloud environments and the after_script to tear down the infrastructure in case something goes wrong during testing. It is common for the tear down of cloud infrastructure to take longer than 5 minutes. Please allow for the timeout in the after_script to be configurable rather than hard-coded to 5 minutes or even 10 minutes. Don't just change the problem so that it hurts fewer people. Allowing the timeout to be configurable solves the problem.
My after_script is collecting docker container logs so I can debug failures. I have no idea why five invocations of docker-compose logs service > service.log need more than 5 minutes to capture 136 megabytes of log data, but this silent timeout prevents me from capturing the logs of another two services and invoking docker-compose down.
In our use case the teardown would need to be able to run for about 2-3hrs. We have a pipeline that builds the entire Infrastructure from nothing to ensure that it can be done this way (using terraform and terraspace). The teardown needs to occur irregardless of whether the script fails otherwise it leaves infrastructure (EC2 instances, load balancers, etc) behind which prevent the next setup and teardown to work correctly.
To work around this we are currently allowing the spin up stage to fail, and running the teardown in a secondary stage and issuing the error code at the end of this from the spin up stage. This is an ugly process and fails in the ability for a developer to get feedback immediately upon failure for a fail fast and fail early approach.
We ended up working around the problem by avoiding after_script altogether and haven't looked back. Instead of using an after_script block, we just structure an equivalent to a try/finally in whatever program being called from the script block. In a Python or Java program, this would be a try/finally structure. In bash or other shell script, you could use trap exit_fn EXIT.
Who came up with this design??? Why hardcode it to 5 minutes? There are a lot of use-cases that needs more than 5 minutes to complete.. This needs to be configurable..
We're running a paid self-hosted instance of gitlab. We develop games and were using the after_script section to package the build of the game into a zip file to be used by later steps. We wanted to change the Unity backend, which caused the zip time to now takes longer than five minutes. The solution is simple, move the zip step to part of the script section. However, the fact that the system does not show any warning or error message to tell you why it failed lost us a lot of time chasing deadends to troubleshoot this. We wasted a month on this.
I also struggled with this issue. At the end, our workaround was to "embed" a custom after_script within the script section. This has also its drawbacks, like having to specify all the commands under variables. Here, you can see a simplified code snippet for our .job-template.
.job-template:variables:CMD_SCRIPT:''CMD_POST_STEPS:''script:# The actions that the user wants to execute in the script section are defined# under the CMD_SCRIPT variable. This provides the ability to override only # a part of the script section (the actions defined under CMD_SCRIPT) without# nuking the other elements defined here, like the post actions.#-|bash -exc "${CMD_SCRIPT}" && EXIT_CODE=$? || EXIT_CODE=$?# This section (post_actions) is executed always, even if there were errors on the# commands above. The semantics of the commands that are executed here are self explanatory.#-|-bash -xc "echo 'Executing post steps'${CMD_POST_STEPS}"# Return success if and only if the script section (CMD_SCRIPT) has succeeded-'[${EXIT_CODE}=0]'my-job:extends:.job-templatevariables:CMD_SCRIPT:|echo "Executing the script"pytest ./tests -n 2CMD_POST_STEPS:|tar czvf <path-to-test-results><push-test-results-to-nexus>
The full code can be found in here. I also wrote an article discussing the issue and the workaround.
Of course, all this complicated code would not be needed if GitlabCI got rid of this annoying timeout.
@yofeldstein It's currently a candidate for 15.10. The one caveat is that we still need to spend time investigating a potential solution. Thus the current workflow label = workflowrefinement.
Our main annoyance was not that it had a 5-minute limit, it was that the UI did not tell us why it failed. A customizable timeout would be great, but even just having it spit out an error message saying that our after script terminated because it takes longer than five minutes would have saved us a lot of time.
We are also being bitten by this issue, in a context similar to #2716 (comment 1135458959). We need to upload a significant amount of data in after_script and can't really use artifacts or another workaround for this easily without making significant changes to our CI setup.
Having the timeout be configurable (or set to a high default like 30 minutes) would solve our problems.
I'm my production workflow all projects that output binaries upload them using a templated after_script. We can't use artifacts as our files exceed the artifact size limit.
Whether due to internet lossage, gitlab being a bit slower, or our servers getting tired, the after_script has gradually been taking longer and now regularly hits the 5 minute mark, but because an after_script error doesn't count as a pipeline failure the pipeline continues as though the output had been published. This is causing empty files to get deployed into production, which, all things considered, is undesirable.
Please can a fix, a work around, or even a monkey patch, for this be made available?