need to improve how cluster-machines-ready Job times out
Today the cluster-machines-ready Job times has fairly basic retry parameters, they can be tuned a bit with cluster_machines_ready.wait_timeout environment value, but this possibly isn't good enough.
This is a pain because on timeout, a manual Job delete is necessary (ideally followed by a flux reconcile).
We should discuss and see what we can do:
- ensure that activeDeadlineSeconds is aligned with
cluster_machines_ready.wait_timeout? - have a rough adaptation of the default for
cluster_machines_ready.wait_timeoutbased on the number of nodes ? - increase activeDeadlineSeconds ?
- increase backoffLimit ?
/cc @ader1990
Edited by Thomas Morin