need to improve how cluster-machines-ready Job times out

Today the cluster-machines-ready Job times has fairly basic retry parameters, they can be tuned a bit with cluster_machines_ready.wait_timeout environment value, but this possibly isn't good enough.

This is a pain because on timeout, a manual Job delete is necessary (ideally followed by a flux reconcile).

We should discuss and see what we can do: