keycloak-postgres unit timing out

I have seen the following timeout on keycloak-postgres unit, on many CI jobs:

2025/11/03 16:29:44.959427 Unit timeout exceeded: unit Kustomization/keycloak-postgresql did not became ready after 5m0s
Timed-out waiting for the following resources to be ready:
IDENTIFIER                                      STATUS     REASON                 MESSAGE
Kustomization/sylva-system/keycloak-postgresql  InProgress                        Kustomization generation is 1, but latest observed generation is -1
╰┄╴┬┄┄[Conditions]
   ├┄╴Reconciling                               True       ProgressingWithRetry   Running health checks for revision 0.0.0-git-e6894956@sha256:2316ceff21c20ce4933caede35ca5555e0db4233e9528b97df0a52eff4c02176 with a timeout of 30s
   ├┄╴Ready                                     False      HealthCheckFailed      health check failed after 22.475049ms: failed early due to stalled resources: [Cluster/keycloak/keycloak-postgresql status: 'Failed']
   ╰┄╴Healthy                                   False      HealthCheckFailed      failed early due to stalled resources: [Cluster/keycloak/keycloak-postgresql status: 'Failed']

After analyzing the failure in https://gitlab.com/sylva-projects/sylva-core/-/jobs/11953178294, I gathered the following timeline:

  • 16:24:44 unit starts to reconcile
  • (the 3 pods start one after the other)
  • 16:29:19 -- last failed check done by flux (there is one every minute)
$ k -n flux-system logs kustomize-controller-7fc95f8cf7-mxb5j manager |grep 'keycloak-postgres' | jq | tail -20
...
{
  "level": "error",
  "ts": "2025-11-03T16:29:19.578Z",
  "msg": "Reconciliation failed after 323.657306ms, next try in 1m0s",
...
  "error": "health check failed after 22.475049ms: failed early due to stalled resources: [Cluster/keycloak/keycloak-postgresql status: 'Failed']"
}
  • 16:29:20 -- the Cluster/keycloak/keycloak-postgresql becomes ok
kubectl get cluster.postgres -n keycloak -o yaml --show-managed-fields keycloak-postgresql | yq .status.conditions[1]
lastTransitionTime: "2025-11-03T16:29:20Z"
message: Cluster is Ready
reason: ClusterIsReady
status: "True"
type: Ready
  • 16:29:44 -- sylvactl stops waiting (we're 5 minutes after the unit reconciliation start)

So what happens is simply that the Cluster.postgres resources becomes ready a bit too late compared to the per-unit timeout. The deployment would have passed if we had been luckier and there had been a flux periodic check between 16:29:20 and 16:29:44.

Since this is only a matter of a few second, and since 5 minutes really isn't very look, I think that our unit-timeout simply is a bit too short.

EDIT: when adjusting this timeout, I noticed that we had tuned the keycloak-postgres unit timeout already, but when refactoring this into the new keycloak-postgresql unit (new name), the sylvactl/unitTimeout we lost it

Edited Nov 05, 2025 by Thomas Morin
Assignee Loading
Time tracking Loading