fleet-agent CrashLoopBackOff after skip-version upgrade (1.4 to 1.6 / Rancher 2.10 to 2.12)

After a direct Sylva 1.4 to 1.6 skip version upgrade (Rancher 2.10.3 to 2.12.3), the management cluster's local fleet-agent enters CrashLoopBackOff. Workload cluster fleet-agents are not affected.

Artifacts · update-workload-cluster (#14231181608) · Jobs · Sylva-projects / sylva-core · GitLab

Error: environment variable CATTLE_ELECTION_LEASE_DURATION not set

Fleet v0.13.x (shipped with Rancher 2.12.x) added a mandatory check in the fleet-agent-register init container for three election env vars:

  • CATTLE_ELECTION_LEASE_DURATION
  • CATTLE_ELECTION_RENEW_DEADLINE
  • CATTLE_ELECTION_RETRY_PERIOD

The management cluster's fleet-agent bootstrap manifest is stored by Rancher . When upgrading directly from Rancher 2.10.3 to 2.12.3, Rancher does not regenerate the manifest  ,so the init container inherits no election vars from the stale Rancher 2.10.3 manifest. The local fleet agent fails in this case.

Workload cluster fleet-agents are managed by fleet-controller, which fully regenerates the Deployment spec from scratch on upgrade. The new v0.13.x Deployment spec includes all election vars. From the logs Artifacts · update-workload-cluster (#14231181608) · Jobs · Sylva-projects / sylva-core · GitLab, it is clear that workload cluster fleet agent is not impacted by Rancher upgrade from n to n+2 minor version

cc: @tmmorin @marc.bailly1