need to improve how cluster-machines-ready Job times out

Today the cluster-machines-ready Job times has fairly basic retry parameters, they can be tuned a bit with cluster_machines_ready.wait_timeout environment value, but this possibly isn't good enough.

This is a pain because on timeout, a manual Job delete is necessary (ideally followed by a flux reconcile).

We should discuss and see what we can do:

  • ensure that activeDeadlineSeconds is aligned with cluster_machines_ready.wait_timeout ?
  • have a rough adaptation of the default for cluster_machines_ready.wait_timeout based on the number of nodes ?
  • increase activeDeadlineSeconds ?
  • increase backoffLimit ?

/cc @ader1990

Edited Mar 18, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading