keycloak-postgres unit timing out
I have seen the following timeout on keycloak-postgres unit, on many CI jobs:
2025/11/03 16:29:44.959427 Unit timeout exceeded: unit Kustomization/keycloak-postgresql did not became ready after 5m0s
Timed-out waiting for the following resources to be ready:
IDENTIFIER STATUS REASON MESSAGE
Kustomization/sylva-system/keycloak-postgresql InProgress Kustomization generation is 1, but latest observed generation is -1
╰┄╴┬┄┄[Conditions]
├┄╴Reconciling True ProgressingWithRetry Running health checks for revision 0.0.0-git-e6894956@sha256:2316ceff21c20ce4933caede35ca5555e0db4233e9528b97df0a52eff4c02176 with a timeout of 30s
├┄╴Ready False HealthCheckFailed health check failed after 22.475049ms: failed early due to stalled resources: [Cluster/keycloak/keycloak-postgresql status: 'Failed']
╰┄╴Healthy False HealthCheckFailed failed early due to stalled resources: [Cluster/keycloak/keycloak-postgresql status: 'Failed']
After analyzing the failure in https://gitlab.com/sylva-projects/sylva-core/-/jobs/11953178294, I gathered the following timeline:
- 16:24:44 unit starts to reconcile
- (the 3 pods start one after the other)
- 16:29:19 -- last failed check done by flux (there is one every minute)
$ k -n flux-system logs kustomize-controller-7fc95f8cf7-mxb5j manager |grep 'keycloak-postgres' | jq | tail -20
...
{
"level": "error",
"ts": "2025-11-03T16:29:19.578Z",
"msg": "Reconciliation failed after 323.657306ms, next try in 1m0s",
...
"error": "health check failed after 22.475049ms: failed early due to stalled resources: [Cluster/keycloak/keycloak-postgresql status: 'Failed']"
}
- 16:29:20 -- the Cluster/keycloak/keycloak-postgresql becomes ok
kubectl get cluster.postgres -n keycloak -o yaml --show-managed-fields keycloak-postgresql | yq .status.conditions[1]
lastTransitionTime: "2025-11-03T16:29:20Z"
message: Cluster is Ready
reason: ClusterIsReady
status: "True"
type: Ready
- 16:29:44 -- sylvactl stops waiting (we're 5 minutes after the unit reconciliation start)
So what happens is simply that the Cluster.postgres resources becomes ready a bit too late compared to the per-unit timeout. The deployment would have passed if we had been luckier and there had been a flux periodic check between 16:29:20 and 16:29:44.
Since this is only a matter of a few second, and since 5 minutes really isn't very look, I think that our unit-timeout simply is a bit too short.
EDIT: when adjusting this timeout, I noticed that we had tuned the keycloak-postgres unit timeout already, but when refactoring this into the new keycloak-postgresql unit (new name), the sylvactl/unitTimeout we lost it