Add `podTerminationGracePeriodSeconds` and `cleanupGracePeriodSeconds` configuration options to the Kubernetes executor
Description
In !383 (merged), a setting was introduced to allow users to set the terminationGracePeriodSeconds
property of a pod. The motivation for this was to be able to control how quickly pods were cleaned up after completion of a build. This is a useful feature (although I can't imagine using anything other than the default of 0 seconds).
However, the setting applies to the deletion of the pod in any situation. Our specific problem is that we are using cluster autoscaler, which will honour the terminationGracePeriodSeconds
property of a pod when scaling down[1]. We want to give our jobs a non-zero amount of time to complete before the pod is killed when scaling down, but we also want complete jobs cleaned up ASAP. It is not currently possible to have both.
[1] actually it does allow overriding this value, but will not allow us to increase the value from the pod.
Proposal
Introduce an additional configuration option CleanupGracePeriodSeconds
, which is used in the DeleteSpec
when cleaning up pods.
I have posted below a naive implementation which does exactly what we need, but would be a breaking change for users who have executors configured with TerminationGracePeriodSeconds
> 0, as they would find that their job pods would be cleaned up immediately (though I would argue that this would be a pleasant surprise!).
There are two fully backwards-compatible options I can think of:
-
A Introduce an additional flag e.g.
AllowOverrideCleanupGracePeriodSeconds
. Iffalse
, theTerminationGracePeriodSeconds
would be used in cleanup and on the pod property (i.e. the existing behaviour); iftrue
, the new behaviour is activated. Eventually this flag can be dropped and the new behaviour can prevail. -
Use pointers for the grace periods in
KubernetesConfig
here. This allows us to defaultCleanupGracePeriodSeconds
toTerminationGracePeriodSeconds
. This is not ideal, asCleanupGracePeriodSeconds
really should default to 0 as this is what most or all people will want.
I don't believe either of these are better solutions than the naive one, though possibly a necessary evil. I'm happy to implement either or both of them.
EDIT:
- Introduce two new params:
PodTerminationGracePeriodSeconds
(applied as a property of the job pod) andCleanupGracePeriodSeconds
(used when cleaning up the pod) but only use the values of these params if no value is specified for the originalTerminationGracePeriodSeconds
.
Option 3 is the what I went with in !2130 (merged)
Links to related issues and merge requests / references
Naive implementation (technically breaking change):
https://gitlab.com/dbevacqua/gitlab-runner/-/commit/e9bd725ce983c74f422e9f77312668b01272a765