This feature is now a candidate for 16.4 at the earliest as we still need to refine the implementation proposal.
Description
Allow configurable after_script timeouts so that post script cleanup can be done here even when it takes longer than 5 minutes.
This can be nice for semantically breaking up when things happen but also if you want something to always run even if the script failed but it takes more than 5 minutes.
I'm currently just moving it into a script action because I don't really need it to fire on failure (waiting for an ECS deployment to finish draining previous tasks).
Proposal
The after_script timeout is currently hardcoded to 5 minutes. It would be nice to make this optionally configurable just like the main script actions are.
In the original issue raised about adding after_script it was noted that this was planned to be some arbitrary timeout with a later configuration option
Links to related issues and merge requests / references
We have rather heavy test environment tear-down actions (shutdown WebLogic server, collect all logs) followed by upload of test execution reports (500MB) to a dedicated server which needs longer to complete.
P.S.: the fact that after_script timeout is not reported in anyway by the Runner makes the issue that much harder to pin down
The fact that the 5 minute limit is not mentioned in the documentation and especially that no error message is shown anywhere is really a showstopper...
Guys, it's a real flaw, that this limitation is not mentioned in the doc and there is no error message, when after_script killed by timeout. Please, make the option configurable, properly document or, at least, throw an error/warning that task is killed due to timeout. I've spent few days already trying to figure out, why the script is randomly stopped...
This is a 2 years open bug and to be honest I am frustrated with gitlab easy to fix bugs which are just pushed aside in favor of chasing trending tools.
The solution can be in 2 phases
change it to 20 min.
later add configurable option even tough the solution is not a hard coding task .
@steveazz I just saw you talking about this very topic on another issue. I understand the limit is 5 minutes because the after_script is meant for a cleanup rather than a large planned set of work. What are your thoughts on just increasing the maximum a small amount? cc @joshlambert
I'm not sure if increasing will be ideal, as pointed out the after_sciprt should be only used for cleanup/deleting anything that the job created, it shouldn't be used for anything else. I'm not sure I understand the need of having it higher to just delete environments/cleanup. after_script is also called on failures, now if a job failed that means either the user's code is wrong or something is wrong with the environment if something is wrong with the environment the after_script is most likely just going to timeout as well because it would hang, which just lead to cascading problems because the pipelines take longer to fail and the queues are backed up, now even if we allow the customer to configure this it's just going to cause them the same problems. Having configuration also leads to problems that we specify in https://about.gitlab.com/handbook/product/#convention-over-configuration
I'm really curious why users require after_script to be longer then 5m what is so big about their cleanup that needs more than 5 minutes? Why does a customer need more then 5 minutes to delete files and clean up the environment?
One scenario I see if uploading logs on failure, if we want to do we can always use artifacts, specifcally on_failures which is something designed to upload files to be used for debugging after the job has finished.
I feel like we are not really communicating well the intent of after_script:
Sometimes there is a need for a longer tear down actions at the after script .... I fail to realize such a limitation maybe if it was something like 10 min it would make more sense but 5 can be too short.
The after_script is a good place to start collecting logs and most of the time you want to collect logs after the setup have been taken down which sometime can take time especially if you want to do it right.
I am puzzled many times with the limitations and the usability decisions just taken by gitlab as if they are working in the theoretical level without collecting any field feedback at all or in this case ignoring it for 2 years even-tough the fix is very simple.
The after_script is a good place to start collecting logs and most of the time you want to collect logs after the setup have been taken down which sometime can take time especially if you want to do it right.
Isn't this something that artifactirs of failure can archive? How are you gathering the logs?
First we are self hosting gitlab and it generates the following problem.
We are generating several gigabytes of logs per job run and push them to S3.
We know that artifacts could be pushed directly to S3 via configurations.
But the problem is the retrieval of those logs it absolutely kills the gitlab server so we are not using the artifacts mechanism for logs and directly pushes them to S3.
personally i think that the artifact mechanism is good for logs up to 150 MB.
Still before we collect logs we need to tear down our setup which take some time for now it is about 4 min and might take even more in the future.
The point is that 5 min can be sometimes too short and you can see at this thread that i am not the only one complains about it so there is a need.
I took a look at the gitlab-runner code just change it to something like 10 or 20 min is just as simple as chaining 5 to 20.
Making it an option is a little bit more tricky still possible I assume like most of the runner basic configurations regarding timeouts,description ... you would like it to be configurable from the gitlab therefor you need gitlab to support it and the API of course.
So a quick fix is to change the timeout and help gitlab users either if the are using EE or CE not to jump trough hoops just to collect logs or do some tear down actions.
Still in general as I said i do not realize the logic of gitlab decisions many time just taking decisions as if you ignore completely users feedback.
Another example we suffer from gitlab#18283 (closed)
After an upgrade i just noticed that the default behavior was changed and that there is no option for backward comparability and it is there for 2 years and it does not seems to bother anyone.
Thanks for the detailed explanation @tal13 I appreciate it, that helps make the picture a lot more clear on what is happening and helps us make a better decision.
I think we can start by bumping this to 10minutes for now and see if that improves the situation for most users, increasing it a bit higher might just lead to longer pipelines for most users, but we can always increase it later again if we see that it simply is not enough.
What are your thoughts @ayufan@tmaczukin should we increase after_script to 10min instead? I've already raised the concerns of doing this in #2716 (comment 219927166) but given the strong need from the rest of the community we should consider increasing it.
Adding to the above @tal13's comment:
We have run into a similar issue: our after_script originally included upload of build reports to Artifactory (GitLab just can't hold all our test jobs reports exceeding 500MB each) along with shutting down WebLogic server that we just ran our tests against.
That very soon started taking more than 5 minutes and resulting in half of the artifacts suddenly getting lost without any error or warning.
As a work around we had to implement following bash function: