effectively set the dependency on 'cluster-machines-ready' unit, for relevant units

One issue slipped into !2032 (merged): I used the inexisting Release.Upgrade value instead of Release.IsUpgrade making the "have all units wait on it during upgrades" part of the MR ineffective. Credit to @feleouet for noticing the issue.

This MR corrects that.

However, when implementing this MR, the initial idea "have all units wait on cluster-machines-ready during upgrades" had to be refined into "have the units that the 'cluster' unit depends on reconcile before 'cluster', and have the rest of the units reconcile after all CAPI things are ready".

This is explained in this comment. Having "all units depend on cluster-machines-ready is just impossible: for instance a unit like cert-manager, on which the cluster unit indirectly depends on (because capX units depend on cert-manager) can't depend on cluster-machines-ready without creating a circular dependency.

  • The first attempt at an implementation of this MR consisted in introducing a named template that let us determine if a unit is among the units that the cluster unit depends on, directly or indirectly. This was a recursive implementation, and getting it to work wasn't entirely trivial. This was implemented with the great help of @feleouet who patiently listened, challenged implementation approaches and provided a simple but powerful simplification (implemented separately in !2055 (merged)). The major drawback of this approach was that it was heavily recursive and very costly (for all unit inheriting from base-deps we were calling a has-dependency-on named template, which was recursive and hence calling itself, and was also calling unit-def-from-templates which was also interpreting the unit and also calling has-dependency-on). With this implemention, the runtime of helm template was multiplied by 3 (9.8s vs 3.3s on my laptop), and worse the Flux Helm controller was hitting CPU throttling and was in practice not able to compute the content of the manifests.

  • The second implementation (current content of this MR) is doing things differently:

    • the list of units that the cluster unit has a dependency on, directly or indirectly, is computed once in _internal.cluster_unit_deps
    • in the base-deps unit template, whether or not the current unit should depend on the cluster-machines-ready unit is determined using this pre-computed _internal.cluster_unit_deps list
    • for this approach to be possible, _internal.cluster_unit_deps cannot be computed in _internal: computing 'cluster_unit_deps' in _internal would result in fully interpreting 'unit_template.base-deps', resulting in interpreting 'depends_on.cluster-machines-ready' which itself relies on 'cluster_unit_deps', causing a variable interpretation loop
    • (with this approach helm template runs essentially as fast as before)

Contributes to #1157 (closed)

Edited by Thomas Morin

Merge request reports

Loading