update kube-job retry parameters

Closes #2539 (closed)

Closes #2216 (closed)

With this MR, all of our kube-jobs Job will run without any time limitation and their number of retries will not be limited.

This MR simply reintroduces most of !4700 (merged), with however one thing that isn't kept, which is the bit that had led !4700 (merged) to result in regressions around pivot Job: !4700 (merged) was setting the spec.activeDeadlineSeconds of Pods to 300s, which resulted in not letting any Pod of a kube-job to run for more than 5 minutes, which for the pivot Job is a problem because it sometimes runs for more than that and does not always recover from being interrupted.

What this MR does is:

  • increase the spec.activeDeadlineSeconds and the the spec.backoffLimit of kube-job Jobs to a pseudo-infinite value
    • this is sufficient to ensure that we won't anymore have kube-job Kustomization that are failed on a Job that does no retries
  • do not set the spec.activeDeadlineSeconds of Pods (ie. spec.template.spec.activeDeadlineSeconds of Jobs)
    • in !4700 (merged), the idea of setting this field was to avoid the case where a Pod run would be stuck for a long time on something, preventing retries
    • but now we realized that we can't do that without risking to introduce the issue described above for pivot Job (which could impact other kube-jobs -- we don't want to do a change that would imply making assumptions on a duration in which all of our kube-jobs would fit, and/or assumptions on how well which kube-job scripts support re-runs)
  • for some Jobs for which we know that interruptions and re-runs are fine, we set the spec.activeDeadlineSeconds of Pods (ie. spec.template.spec.activeDeadlineSeconds of Jobs) via POD_ACTIVE_DEADLINE_SECONDS
    • we do this for Jobs that only wait for resources to be ready
    • this makes those Jobs robust to an hypothetical scenario where they would be stuck
  • additionally, this MR removes the override of JOB_BACKOFF_LIMIT that had been done for some Jobs to increase their number of retries in !4643 (merged) and improve their robustness. Now that the backoffLimit is pseudo-infinite, this is irrelevant to keep.

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon Meaning Available values
☁️ Infra Provider capd, capo, capm3
🚀 Bootstrap Provider kubeadm (alias kadm), rke2, okd, ck8s
🐧 Node OS ubuntu, suse, na, leapmicro
🛠️ Deployment Options light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging
🎬 Pipeline Scenarios Available scenario list and description
  • 🎬 preview ☁️ capd 🚀 kadm 🐧 ubuntu

  • 🎬 preview ☁️ capo 🚀 rke2 🐧 suse

  • 🎬 preview ☁️ capm3 🚀 rke2 🐧 ubuntu

  • ☁️ capd 🚀 kadm 🛠️ light-deploy 🐧 ubuntu

  • ☁️ capd 🚀 rke2 🛠️ light-deploy 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 leapmicro

  • ☁️ capo 🚀 kadm 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 kadm 🎬 wkld-k8s-upgrade 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update-no-wkld 🛠️ ha 🐧 suse

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🐧 suse

  • ☁️ capm3 🚀 kadm 🐧 ubuntu

  • ☁️ capm3 🚀 ck8s 🐧 ubuntu

  • ☁️ capm3 🚀 kadm 🎬 rolling-update-no-wkld 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 wkld-k8s-upgrade 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🛠️ misc,ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade-from-1.4.x 🛠️ ha,misc 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 ck8s 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2|okd 🎬 no-update 🐧 ubuntu|na

Global config for deployment pipelines

  • autorun pipelines
  • allow failure on pipelines
  • record sylvactl events

Notes:

  • Enabling autorun will make deployment pipelines to be run automatically without human interaction
  • Disabling allow failure will make deployment pipelines mandatory for pipeline success.
  • if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited by Thomas Morin

Merge request reports

Loading