Add custom metrics for clusters monitoring

What does this MR do and why?

In order to create proper alerts when a client cluster Prometheus goes down and take required action, as a starting point we have to create specific metrics which will show the state and analyze possible failures that could occur. The aim of this MR is to introduce custom resource metrics for cluster resource ( on k8s level but in the same time on rancher checking if the cluster is successfully enrolled ).

Below you can see how those metrics looks like:

sylva_cluster_info{customresource_group="cluster.x-k8s.io", customresource_kind="Cluster", customresource_version="v1beta1", job="kube-state-metrics", namespace="cattle-monitoring-system", platform_tag="Sylva", pod="rancher-monitoring-kube-state-metrics-67fdc6978b-fmjjd", receive="true", service="rancher-monitoring-kube-state-metrics", status="Pending", sylva_cluster_name="no-prom-test", sylva_cluster_namespace="no-prom-test", tenant_id="default-tenant"} <= wrong cluster created for testing purposes
sylva_cluster_info{customresource_group="cluster.x-k8s.io", customresource_kind="Cluster", customresource_version="v1beta1", job="kube-state-metrics", namespace="cattle-monitoring-system", platform_tag="Sylva", pod="rancher-monitoring-kube-state-metrics-67fdc6978b-fmjjd", receive="true", service="rancher-monitoring-kube-state-metrics", status="Provisioned", sylva_cluster_name="test-cluster", sylva_cluster_namespace="test-cluster", tenant_id="default-tenant"}
sylva_cluster_info{customresource_group="cluster.x-k8s.io", customresource_kind="Cluster", customresource_version="v1beta1",  namespace="cattle-monitoring-system", platform_tag="Sylva", pod="rancher-monitoring-kube-state-metrics-67fdc6978b-fmjjd", receive="true", service="rancher-monitoring-kube-state-metrics", status="Provisioned", sylva_cluster_name="test-mgmt", sylva_cluster_namespace="sylva-system", tenant_id="default-tenant"}
rancher_cluster_info{customresource_group="provisioning.cattle.io", customresource_kind="Cluster", customresource_version="v1", endpoint="http", instance="100.72.61.178:8080", job="kube-state-metrics", namespace="cattle-monitoring-system", platform_tag="Sylva", pod="rancher-monitoring-kube-state-metrics-67fdc6978b-fmjjd", rancher_cluster_name="local", ready="true", receive="true", service="rancher-monitoring-kube-state-metrics", tenant_id="default-tenant"}
rancher_cluster_info{customresource_group="provisioning.cattle.io", customresource_kind="Cluster", customresource_version="v1", endpoint="http", instance="100.72.61.178:8080", job="kube-state-metrics", namespace="cattle-monitoring-system", platform_tag="Sylva", pod="rancher-monitoring-kube-state-metrics-67fdc6978b-fmjjd", rancher_cluster_name="test-cluster-capi", ready="true", receive="true", service="rancher-monitoring-kube-state-metrics", tenant_id="default-tenant"}

In terms of alerts, a specific MR has been raised (sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules!52 (merged)) which handle the rules/alerts configuration.

Beside this new metrics, I've noticed some inconsistences in flux dashboards (not able to show data), due of invalid metrics. To fix that I've updated the apis to the latest version in custom state metrics related to flux and update the pod monitor to be deployed in flux-system namespace.

cc: @alinhg

Related reference(s)

Closes #1428 (closed)

Test coverage

CI configuration

CI pipelines perform an update for both management and workload clusters, this update will NOT perform a ClusterAPI rolling update (deletion and creation of new K8s nodes) by default.

For some cases, it may be relevant to perform more complex tests.

Theses features can be activated in an MR by adding one of these labels to the MR and will apply to the next pipelines.

  • adding the label ci-featuretest-rolling-update pipelines will perform a node rolling update in the -update jobs (without version upgrades)
  • adding the label ci-featuretest-upgrade-from-1.1.1 pipelines will perform an upgrade from Sylva 1.1.1 to your dev branch (including a k8s version upgrade resulting in a node rolling update)
Edited by Bogdan Antohe

Merge request reports

Loading