rancher webhook deployment breaks sylva-units pivot

I observed the following deployment CI job failure:

(https://gitlab.com/sylva-projects/sylva-core/-/jobs/6080883635)

Command timeout exceeded, waiting the following resources to progress:
    HelmRelease/sylva-system/sylva-units               UpgradeFailed - Helm upgrade failed for release sylva-system/sylva-units with chart sylva-units@0.0.0-git+84688764d845.1: failed to create resource: Internal error occurred: failed calling webhook "rancher.cattle.io.secrets": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s": no endpoints available for service "rancher-webhook"
    - lastTransitionTime: "2024-02-02T09:58:43Z"
      message: Failed to upgrade after 1 attempt(s)
      observedGeneration: 2
      reason: RetriesExceeded
      status: "True"
      type: Stalled
    - lastTransitionTime: "2024-02-02T09:58:42Z"
      message: 'Helm upgrade failed for release sylva-system/sylva-units with chart
        sylva-units@0.0.0-git+84688764d845.1: failed to create resource: Internal
        error occurred: failed calling webhook "rancher.cattle.io.secrets": failed
        to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s":
        no endpoints available for service "rancher-webhook"'
      observedGeneration: 2
      reason: UpgradeFailed
      status: "False"

What I understand is that:

  • the sylva-units HelmRelease is initially installed fine on the mgmt cluster from the bootstrap phase
  • the Rancher unit is installed, and among other things:
    • it declares a Webhook for rancher webhook
    • it defines a rancher-webhook Service
    • it starts the rancher-webhook Deployment
  • in parallel with that, the bootstrap units updates the sylva-units HelmRelease to enable the cluster unit (right after pivot)
  • the failure we see is this update of the HelmRelease failing
  • it fails because:
    • on the first attempt, the rancher webhook is declared, but without any rancher-webhook pod being ready yet
    • (and there is no second attempt)

Some comments:

  • this issue may or may not occur depending on the exact timing of things
  • this issue isn't specific to rancher (may happen with other webhooks active on a lot of resources)
  • this issue is not specific to bootstrap phase: in fact @loic.nicolle brought up another issue here about this failure https://gitlab.com/sylva-projects/sylva-core/-/jobs/6078723543#L921 - and I suspect that it's possibly the same issue ...

/cc @feleouet

Assignee Loading
Time tracking Loading