Add minio-monitoring-tenant unit healtchCheck for StatefulSet/monitoring-pool-monitoring
What does this MR do and why?
While debugging Nightly - CAPM3 HA Ubuntu of 2024-09-03T20:30 capm3-ha-kubeadm-virt-ubuntu variant we could notice the update-management-cluster job failing due to
2024/09/03 22:34:51.192082 Kustomization/minio-monitoring-tenant state changed: HealthCheckFailed - health check failed after 29.456577ms: failed early due to stalled resources: [StatefulSet/minio-monitoring-tenant/monitoring-pool-0 status: 'Failed']
2024/09/03 22:34:55.289621 Command timeout exceeded
Timed-out waiting for the following resources to be ready:
IDENTIFIER STATUS REASON MESSAGE
Kustomization/sylva-system/minio-monitoring-tenant InProgress Kustomization generation is 3, but latest observed generation is 2
╰┄╴HelmRelease/sylva-system/minio-monitoring-tenant Ready Resource is Ready
╰┄╴Tenant/minio-monitoring-tenant/monitoring Ready Resource is current
├┄╴StatefulSet/minio-monitoring-tenant/monitoring-pool-0 InProgress Ready: 1/2
┆ ╰┄╴Pod/minio-monitoring-tenant/monitoring-pool-0-0 Failed Pod could not be scheduled
┆ ├┄╴┬┄┄[Conditions]
┆ ┆ ╰┄╴PodScheduled False Unschedulable 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
┆ ╰┄╴┬┄┄[Events]
┆ ├┄╴2024-09-03 21:52:04 Warning FailedScheduling 0/4 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
┆ ├┄╴2024-09-03 22:05:04 (x24 over 18m33s) Warning FailedScheduling 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
┆ ├┄╴2024-09-03 22:07:30 (x4 over 13m40s) Warning FailedScheduling 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
┆ ╰┄╴2024-09-03 22:30:03 (x9 over 37m48s) Warning FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
╰┄╴StatefulSet/minio-monitoring-tenant/monitoring-pool-monitoring InProgress Ready: 3/4
╰┄╴Pod/minio-monitoring-tenant/monitoring-pool-monitoring-2 Failed Pod could not be scheduled
├┄╴┬┄┄[Conditions]
┆ ╰┄╴PodScheduled False Unschedulable 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
╰┄╴┬┄┄[Events]
├┄╴2024-09-03 21:38:52 Warning FailedScheduling 0/4 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available: 2 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
├┄╴2024-09-03 21:38:57 (x2 over 2.806753s) Warning FailedScheduling 0/4 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available: 2 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
├┄╴2024-09-03 21:42:18 (x2 over 10.63034s) Warning FailedScheduling 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
├┄╴2024-09-03 21:45:13 Warning FailedScheduling 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
├┄╴2024-09-03 21:45:37 Warning FailedScheduling 0/4 nodes are available: 1 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }, 1 node(s) were unschedulable, 2 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 2 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
├┄╴2024-09-03 22:00:47 (x23 over 14m58s) Warning FailedScheduling 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
├┄╴2024-09-03 22:07:30 (x4 over 13m41s) Warning FailedScheduling 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
├┄╴2024-09-03 22:10:50 (x2 over 18m46s) Warning FailedScheduling 0/4 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
╰┄╴2024-09-03 22:30:03 (x9 over 37m48s) Warning FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
This problem is already addressed through #1584 (closed) and reports the fact that any of the 6 pods deployed through Kustomization/minio-monitoring-tenant can't sit on the same node as another such pod.
But it was strange that only update-management-cluster stage got affected by the above issue.
As it turned out, the fact that deploy-management-cluster stage Kustomization/minio-monitoring-tenant was ready was pure luck - we see that for this unit we have Kustomization health checks only for StatefulSet/monitoring-pool-0, per https://gitlab.com/sylva-projects/sylva-core/-/blob/b4682033cd2fbe12cb0003b81c11453aafec507f/charts/sylva-units/values.yaml#L5414-5418, but nothing for the new pool introduced in !2748 (merged), so StatefulSet/monitoring-pool-monitoring's pods are not waited for before confirming readiness.
And luck played a role because:
- in deploy-management-cluster we see
# management-cluster-dump/minio-monitoring-tenant/events.log
2024-09-03T21:05:07.945133Z 2024-09-03T21:05:48.411562Z default-scheduler-default-scheduler-mgmt-1438613365-kubeadm-capm3-virt-management-cp-1 Pod monitoring-pool-monitoring-1 2 FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
2024-09-03T21:05:07.945596Z 2024-09-03T21:05:48.412079Z default-scheduler-default-scheduler-mgmt-1438613365-kubeadm-capm3-virt-management-cp-1 Pod monitoring-pool-monitoring-3 2 FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
so the two pods that were left outside by kube-scheduler were created by StatefulSet/monitoring-pool-monitoring, while
- in update-management-cluster we see
# management-cluster-dump/minio-monitoring-tenant/events.log
2024-09-03T21:52:15Z 2024-09-03T22:35:02Z default-scheduler- Pod monitoring-pool-0-0 11 FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
2024-09-03T21:52:15Z 2024-09-03T22:35:02Z default-scheduler- Pod monitoring-pool-monitoring-2 11 FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
one certain Pod/monitoring-pool-0-0, which is for StatefulSet/monitoring-pool-0, that does not pass health check.
This MR adds a health check for StatefulSet/monitoring-pool-monitoring.
CC: @marc.bailly1