rancher webhook deployment breaks sylva-units pivot
I observed the following deployment CI job failure:
(https://gitlab.com/sylva-projects/sylva-core/-/jobs/6080883635)
Command timeout exceeded, waiting the following resources to progress:
HelmRelease/sylva-system/sylva-units UpgradeFailed - Helm upgrade failed for release sylva-system/sylva-units with chart sylva-units@0.0.0-git+84688764d845.1: failed to create resource: Internal error occurred: failed calling webhook "rancher.cattle.io.secrets": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s": no endpoints available for service "rancher-webhook"
- lastTransitionTime: "2024-02-02T09:58:43Z"
message: Failed to upgrade after 1 attempt(s)
observedGeneration: 2
reason: RetriesExceeded
status: "True"
type: Stalled
- lastTransitionTime: "2024-02-02T09:58:42Z"
message: 'Helm upgrade failed for release sylva-system/sylva-units with chart
sylva-units@0.0.0-git+84688764d845.1: failed to create resource: Internal
error occurred: failed calling webhook "rancher.cattle.io.secrets": failed
to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s":
no endpoints available for service "rancher-webhook"'
observedGeneration: 2
reason: UpgradeFailed
status: "False"
What I understand is that:
- the sylva-units HelmRelease is initially installed fine on the mgmt cluster from the bootstrap phase
- the Rancher unit is installed, and among other things:
- it declares a Webhook for rancher webhook
- it defines a rancher-webhook Service
- it starts the rancher-webhook Deployment
- in parallel with that, the bootstrap units updates the sylva-units HelmRelease to enable the
clusterunit (right after pivot) - the failure we see is this update of the HelmRelease failing
- it fails because:
- on the first attempt, the rancher webhook is declared, but without any rancher-webhook pod being ready yet
- (and there is no second attempt)
Some comments:
- this issue may or may not occur depending on the exact timing of things
- this issue isn't specific to rancher (may happen with other webhooks active on a lot of resources)
- this issue is not specific to bootstrap phase: in fact @loic.nicolle brought up another issue here about this failure https://gitlab.com/sylva-projects/sylva-core/-/jobs/6078723543#L921 - and I suspect that it's possibly the same issue ...
/cc @feleouet