Add minio-monitoring-tenant unit healtchCheck for StatefulSet/monitoring-pool-monitoring (!2826) · Merge requests · Sylva-projects / sylva-core

What does this MR do and why?

While debugging Nightly - CAPM3 HA Ubuntu of 2024-09-03T20:30 ❌ run for capm3-ha-kubeadm-virt-ubuntu variant we could notice the update-management-cluster job failing due to

2024/09/03 22:34:51.192082 Kustomization/minio-monitoring-tenant state changed: HealthCheckFailed - health check failed after 29.456577ms: failed early due to stalled resources: [StatefulSet/minio-monitoring-tenant/monitoring-pool-0 status: 'Failed']
2024/09/03 22:34:55.289621 Command timeout exceeded
Timed-out waiting for the following resources to be ready:
IDENTIFIER                                                              STATUS     REASON           MESSAGE
Kustomization/sylva-system/minio-monitoring-tenant                      InProgress                  Kustomization generation is 3, but latest observed generation is 2
╰┄╴HelmRelease/sylva-system/minio-monitoring-tenant                     Ready                       Resource is Ready
   ╰┄╴Tenant/minio-monitoring-tenant/monitoring                         Ready                       Resource is current
      ├┄╴StatefulSet/minio-monitoring-tenant/monitoring-pool-0          InProgress                  Ready: 1/2
      ┆  ╰┄╴Pod/minio-monitoring-tenant/monitoring-pool-0-0             Failed                      Pod could not be scheduled
      ┆     ├┄╴┬┄┄[Conditions]
      ┆     ┆  ╰┄╴PodScheduled                                          False      Unschedulable    0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
      ┆     ╰┄╴┬┄┄[Events]
      ┆        ├┄╴2024-09-03 21:52:04                                   Warning    FailedScheduling 0/4 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
      ┆        ├┄╴2024-09-03 22:05:04 (x24 over 18m33s)                 Warning    FailedScheduling 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
      ┆        ├┄╴2024-09-03 22:07:30 (x4 over 13m40s)                  Warning    FailedScheduling 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
      ┆        ╰┄╴2024-09-03 22:30:03 (x9 over 37m48s)                  Warning    FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
      ╰┄╴StatefulSet/minio-monitoring-tenant/monitoring-pool-monitoring InProgress                  Ready: 3/4
         ╰┄╴Pod/minio-monitoring-tenant/monitoring-pool-monitoring-2    Failed                      Pod could not be scheduled
            ├┄╴┬┄┄[Conditions]
            ┆  ╰┄╴PodScheduled                                          False      Unschedulable    0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
            ╰┄╴┬┄┄[Events]
               ├┄╴2024-09-03 21:38:52                                   Warning    FailedScheduling 0/4 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available: 2 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
               ├┄╴2024-09-03 21:38:57 (x2 over 2.806753s)               Warning    FailedScheduling 0/4 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/4 nodes are available: 2 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
               ├┄╴2024-09-03 21:42:18 (x2 over 10.63034s)               Warning    FailedScheduling 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
               ├┄╴2024-09-03 21:45:13                                   Warning    FailedScheduling 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
               ├┄╴2024-09-03 21:45:37                                   Warning    FailedScheduling 0/4 nodes are available: 1 node(s) had untolerated taint {node.cluster.x-k8s.io/uninitialized: }, 1 node(s) were unschedulable, 2 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 2 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling..
               ├┄╴2024-09-03 22:00:47 (x23 over 14m58s)                 Warning    FailedScheduling 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
               ├┄╴2024-09-03 22:07:30 (x4 over 13m41s)                  Warning    FailedScheduling 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
               ├┄╴2024-09-03 22:10:50 (x2 over 18m46s)                  Warning    FailedScheduling 0/4 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..
               ╰┄╴2024-09-03 22:30:03 (x9 over 37m48s)                  Warning    FailedScheduling 0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..

This problem is already addressed through #1584 (closed) and reports the fact that any of the 6 pods deployed through Kustomization/minio-monitoring-tenant can't sit on the same node as another such pod.

But it was strange that only update-management-cluster stage got affected by the above issue.
As it turned out, the fact that deploy-management-cluster stage Kustomization/minio-monitoring-tenant was ready was pure luck - we see that for this unit we have Kustomization health checks only for StatefulSet/monitoring-pool-0, per https://gitlab.com/sylva-projects/sylva-core/-/blob/b4682033cd2fbe12cb0003b81c11453aafec507f/charts/sylva-units/values.yaml#L5414-5418, but nothing for the new pool introduced in !2748 (merged), so StatefulSet/monitoring-pool-monitoring's pods are not waited for before confirming readiness.
And luck played a role because:

in deploy-management-cluster we see

# management-cluster-dump/minio-monitoring-tenant/events.log

2024-09-03T21:05:07.945133Z	2024-09-03T21:05:48.411562Z	default-scheduler-default-scheduler-mgmt-1438613365-kubeadm-capm3-virt-management-cp-1	Pod	monitoring-pool-monitoring-1	2	FailedScheduling	0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
2024-09-03T21:05:07.945596Z	2024-09-03T21:05:48.412079Z	default-scheduler-default-scheduler-mgmt-1438613365-kubeadm-capm3-virt-management-cp-1	Pod	monitoring-pool-monitoring-3	2	FailedScheduling	0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..

so the two pods that were left outside by kube-scheduler were created by StatefulSet/monitoring-pool-monitoring, while

in update-management-cluster we see

# management-cluster-dump/minio-monitoring-tenant/events.log

2024-09-03T21:52:15Z    2024-09-03T22:35:02Z    default-scheduler-  Pod monitoring-pool-0-0 11  FailedScheduling    0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..
2024-09-03T21:52:15Z    2024-09-03T22:35:02Z    default-scheduler-  Pod monitoring-pool-monitoring-2    11  FailedScheduling    0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..

one certain Pod/monitoring-pool-0-0, which is for StatefulSet/monitoring-pool-0, that does not pass health check.

This MR adds a health check for StatefulSet/monitoring-pool-monitoring.

CC: @marc.bailly1

Related reference(s)

Test coverage

Edited Sep 04, 2024 by Bogdan-Adrian Burciu

Add minio-monitoring-tenant unit healtchCheck for StatefulSet/monitoring-pool-monitoring

What does this MR do and why?

Related reference(s)

Test coverage

Merge request reports