Skip to content

Add `podTerminationGracePeriodSeconds` and `cleanupGracePeriodSeconds` configuration options to the Kubernetes executor

Description

In !383 (merged), a setting was introduced to allow users to set the terminationGracePeriodSeconds property of a pod. The motivation for this was to be able to control how quickly pods were cleaned up after completion of a build. This is a useful feature (although I can't imagine using anything other than the default of 0 seconds).

However, the setting applies to the deletion of the pod in any situation. Our specific problem is that we are using cluster autoscaler, which will honour the terminationGracePeriodSeconds property of a pod when scaling down[1]. We want to give our jobs a non-zero amount of time to complete before the pod is killed when scaling down, but we also want complete jobs cleaned up ASAP. It is not currently possible to have both.

[1] actually it does allow overriding this value, but will not allow us to increase the value from the pod.

Proposal

Introduce an additional configuration option CleanupGracePeriodSeconds, which is used in the DeleteSpec when cleaning up pods.

I have posted below a naive implementation which does exactly what we need, but would be a breaking change for users who have executors configured with TerminationGracePeriodSeconds > 0, as they would find that their job pods would be cleaned up immediately (though I would argue that this would be a pleasant surprise!).

There are two fully backwards-compatible options I can think of:

  1. A Introduce an additional flag e.g. AllowOverrideCleanupGracePeriodSeconds. If false, the TerminationGracePeriodSeconds would be used in cleanup and on the pod property (i.e. the existing behaviour); if true, the new behaviour is activated. Eventually this flag can be dropped and the new behaviour can prevail.

  2. Use pointers for the grace periods in KubernetesConfig here. This allows us to default CleanupGracePeriodSeconds to TerminationGracePeriodSeconds. This is not ideal, as CleanupGracePeriodSeconds really should default to 0 as this is what most or all people will want.

I don't believe either of these are better solutions than the naive one, though possibly a necessary evil. I'm happy to implement either or both of them.

EDIT:

  1. Introduce two new params: PodTerminationGracePeriodSeconds (applied as a property of the job pod) and CleanupGracePeriodSeconds (used when cleaning up the pod) but only use the values of these params if no value is specified for the original TerminationGracePeriodSeconds.

Option 3 is the what I went with in !2130 (merged)

Links to related issues and merge requests / references

!2130 (merged)

Naive implementation (technically breaking change):

https://gitlab.com/dbevacqua/gitlab-runner/-/commit/e9bd725ce983c74f422e9f77312668b01272a765

Edited by Darren Eastman