Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git to v0.2.0 (main) (!4627) · Merge requests · Sylva-projects / sylva-core

This MR contains the following updates:

Package	Update	Change
https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git	minor	`0.1.4` -> `0.2.0`

Release Notes

sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git)

`v0.2.0`: sylva-thanos-rules: 0.2.0

Compare Source

Merge Requests integrated in this release

Contributors

sylva-thanos-rules

Generate ConfigMap object for consumption by Thanos Ruler

Details about rules

rules/_helper_kubernetes_metadata.yml

Alert Name	For	Severity	Type	Description
k8s-Metamonitoring_configuration_error_kube_namespace_labels	45m	error	deployment	Metric "kube_namespace_labels" from cluster "{{ $labels.capi_cluster_name }}" is not exposed by "kube-state-metrics".
k8s-Metamonitoring_configuration_error_rancher_project_info	45m	error	deployment	Metric "rancher_project_info" from the management cluster is not exposed by "kube-state-metrics".

rules/clusters_state_rules.yml

Alert Name	For	Severity	Type	Description
Sylva_cluster_Prometheus_not_Sending_Data_management	45m	critical	deployment	Prometheus server from the management cluster has not sent data in the last 45m.
Sylva_cluster_Prometheus_not_Sending_Data	45m	critical	deployment	Prometheus server from cluster "{{ $labels.capi_cluster_name }}" in namespace "{{ $labels.capi_cluster_namespace }}" has not sent data in the last 45m.
Sylva_clusters_different_number	45m	critical	deployment	Some cluster is not properly provisioned in Rancher, check all clusters to see if cattle-agent is properly deployed
Sylva_clusters_metric_absent	45m	error	deployment	Metric "capi_cluster_info" from the management cluster is not exposed by "kube-state-metrics".

rules/kubernetes_capacity.yml

Alert Name	For	Severity	Type	Description
k8s-Cluster_CPU_Overcommitted	5m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable CPUs. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Memory_Overcommitted	5m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable Memory. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Too_Many_Pods	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Node_Too_Many_Pods	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" node {{ $labels.node }} number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Kube_Quota_Almost_Full	15m	warning	k8s	Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value
k8s-Kube_Quota_Exceeded	15m	error	k8s	Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value

rules/kubernetes_cluster_components.yml

Alert Name	For	Severity	Type	Description
k8s-Version_Mismatch	4h	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has different versions of Kubernetes components running. Value: {{ $value }}
k8s-Client_Errors	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server client "{{ $labels.instance }}" job "{{ $labels.job }}" is experiencing errors. Value: {{ printf "%0.0f" $value }}%
k8s-Client_Certificate_Expiration	5m	warning	k8s	A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 7 days on cluster {{ $labels.cluster }}.
k8s-Client_Certificate_Expiration	5m	critical	k8s	A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 24h on cluster {{ $labels.cluster }}.
k8s-API_Global_Error_Rate_High	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for over 3% of requests. Value: {{ humanize $value }}%
k8s-API_Error_Rate_High	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for 10% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. Value: {{ humanize $value }}%
k8s-Aggregated_API_Errors	15m	warning	k8s	Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has reported errors. It has appeared unavailable {{ $value
k8s-Aggregated_API_Down	15m	warning	k8s	Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has been only {{ $value
k8s-API_Endpoint_Down	15m	error	k8s	Kubernetes API endpoint {{ $labels.instance }} in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Down	15m	critical	k8s	Kubernetes API in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Terminated_Requests	15m	warning	k8s	Kubernetes API in cluster {{ $labels.cluster }} has terminated {{ $value

rules/kubernetes_jobs.yml

Alert Name	For	Severity	Type	Description
k8s-CronJob_Status_Failed	5m	warning	k8s	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" failed. Last job has failed multiple times. Value: {{ $value }}
k8s-CronJob_Taking_Too_Long	0m	warning	k8s	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" is taking too long to completes - is over deadline. Value: {{ humanizeDuration $value }}
k8s-Job_not_Completed	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" is taking more than 12h to complete.
k8s-Job_Failed	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" failed to complete. Removing failed job after investigation should clear this alert.

rules/kubernetes_nodes.yml

Alert Name	For	Severity	Type	Description
k8s-Node_Kubelet_Down	5m	critical	k8s	Kubelet on {{ $labels.node }} in cluster "{{ $labels.cluster }}" is not reachable
k8s-Node_Status_OutOfDisk	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is almost out of disk space
k8s-Node_Status_MemoryPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under memory pressure.
k8s-Node_Status_DiskPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under disk pressure
k8s-Node_Status_PIDPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under PID pressure
k8s-Node_Status_NotReady	5m	critical	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has been not been Ready for more than an hour
k8s-Node_Status_NetworkUnavailable	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has NetworkUnavailable condition.
k8s-Node_Status_Ready_flapping	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" readiness status changed {{ $value }} times in the last 15 minutes.

rules/kubernetes_pods.yml

Alert Name	For	Severity	Type	Description
k8s-Pod_Status_not_Ready	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" has been in a non-ready state for longer than 15 minutes.
k8s-Pod_Status_OOMKilled	0m	critical	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" has been restarted due to OOMKilled reason in the last hour. Value: {{ humanize $value }}
k8s-Pod_Status_Crashlooping	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Init_Container_Status_Crashlooping	15m	warning	k8s	Init Container from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Container_Status_Waiting	1h	warning	k8s	Container "{{ $labels.container }}" from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" has been in waiting state for longer than 1 hour.
k8s-Statefulset_Replicas_not_Ready	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Statefulset_Generation_Mismatch	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" generation for does not match. This indicates that the StatefulSet has failed but has not been rolled back
k8s-Statefulset_Update_not_Rolled_Out	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" update has not been rolled out
k8s-Statefulset_Replicas_Mismatch	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf `kube_statefulset_spec_replicas{statefulset="%s", cluster="%s"}` $labels.statefulset $labels.cluster
k8s-Statefulset_Replicas_not_Updated	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-ReplicaSet_Replicas_Mismatch	15m	warning	k8s	ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas
k8s-Deployment_Replicas_not_Ready	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Deployment_Replicas_Mismatch	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf `kube_deployment_spec_replicas{deployment="%s", cluster="%s"}` $labels.deployment $labels.cluster
k8s-Deployment_Generation_Mismatch	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" generation does not match expected one
k8s-Deployment_Replicas_not_Updated	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-Deployment_Rollout_Stuck	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" is not progressing for longer than 15 minutes.
k8s-Daemonset_Rollout_Stuck	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has less than 100% of desired pods scheduled and ready. Value: {{ humanize $value }}%
k8s-Daemonset_not_Scheduled	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has unscheduled pods. Value: {{ humanize $value }}
k8s-Daemonset_Misscheduled	15m	warning	k8s	Daemonset pods {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" are running where they are not supposed to. Value: {{ humanize $value }}
k8s-Daemonset_Generation_Mismatch	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" generation does not match expected one

rules/kubernetes_storage.yml

Alert Name	For	Severity	Type	Description
k8s-Persistent_Volume_Disk_Space_Usage_High	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Disk_Space_Usage_High	5m	critical	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Full_in_4_days	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" will fill up in 4 days at the current rate of utilization. Value: {{ printf "%0.2f" $value }}% available
k8s-Persistent_Volume_Inodes_Usage_High	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Inodes_Usage_High	5m	critical	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Errors	5m	warning	k8s	PersistentVolume "{{ $labels.persistentvolume }}" in cluster "{{ $labels.cluster }}" has status "{{ $labels.phase }}"
k8s-Persistent_Volume_Claim_Orphan	3h	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" is not used by any pod

rules/longhorn.yml

Alert Name	For	Severity	Type	Description
Longhorn-Volume_Status_Critical	5m	critical	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Faulted" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Warning	5m	warning	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Degraded" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Unknown	5m	warning	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Unknown" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Node_Storage_Warning	5m	warning	storage	The used storage of node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Disk_Storage_Warning	5m	warning	storage	The used storage of disk "{{ $labels.disk }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Node_Down	5m	critical	storage	There are "{{ $value
Longhorn-Instance_Manager_CPU_Usage_Warning	5m	warning	storage	Longhorn instance manager "{{ $labels.instance_manager }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU request at "{{ $value
Longhorn-Node_CPU_Usage_Warning	5m	warning	storage	Longhorn node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU capacity at "{{ $value

rules/metallb.yml

Alert Name	For	Severity	Type	Description
MetalLB-BGP_Session_Down	5m	error	network	MetalLB speaker {{ $labels.instance }} in cluster "{{ $labels.cluster}}" has BGP session {{ $labels.peer }} down for more than 5 minutes.
MetalLB-BGP_All_Sessions_Down	5m	critical	network	MetalLB in "{{ $labels.cluster}}" all {{ $value }} BGP sessions are down for more than 5 minutes.
MetalLB-Address_Pool_High_Usage	5m	info	network	MetalLB pool "{{ $labels.pool }}" in cluster "{{ $labels.cluster}}" has more than 75% of the total addresses used.
MetalLB-Config_Stale	5m	warning	network	MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" has a stale config.
MetalLB-Config_not_Loaded	5m	warning	network	MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" config not loaded.

rules/node_exporter.yml

Alert Name	For	Severity	Type	Description
Node-Recently_Rebooted	0m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" has been rebooted in the last 30 minutes. Value: {{ humanizeDuration $value }} uptime.
Node-CPU_High_Usage	30m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU usage has exceeded the threshold of 90% for more than 30 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait	15m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-Memory_Major_Pages_Faults	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - memory major page faults are occurring at very high rate. Value: {{ humanize $value }}
Node-Memory_High_Usage	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Memory_High_Usage	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 90% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 80% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage	15m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 90% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Will_Fill_Up_In_4h	5m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" will fill up in 4 hours at the current rate of utilization. Value: {{ printf `node_filesystem_avail_bytes{fstype=~"ext.*
Node-High_Disk_Inodes_High_Usage	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 60% used inodes for more than 15m. Value: {{ humanize $value }}
Node-High_Disk_Inodes_High_Usage	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 70% used inodes for more than 15m. Value: {{ humanize $value }}
Node-Load_High	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - system load per core is above 2 for the last 15 minutes. This might indicate this instance resources saturation and can cause it becoming unresponsive. Value: {{ humanize $value }}
Node-fds_Near_Limit_Process	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - "{{ $labels.job }}" has more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-fds_Near_Limit	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-Network_High_Receive_Drop	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network reception. Value: {{ humanize $value }}
Node-Network_High_Transmit_Drop	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network transmission. Value: {{ humanize $value }}
Node-Network_High_Receive_Errors	30m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered receive errors. Value: {{ humanize $value }}
Node-Network_High_Transmit_Errors	30m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered transmit errors. Value: {{ humanize $value }}
Node-Network_Interface_Flapping	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" changing its up status often. Value: {{ humanize $value }}
Node-Too_Many_OOM_Kills	0m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - several OOM kills detected in the past 1h. Value: {{ humanize $value }}. Find out which process by running `dmesg
Node-Clock_Not_Synchronising	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is not synchronising. Ensure NTP is configured on this host.
Node-Clock_Skew_Detected	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host.
Node-Host_Conntrack_Limit	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 75% of conntrack entries are used.
Node-EDAC_Correctable_Errors_Detected	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Correctable Errors detected.
Node-EDAC_Uncorrectable_Errors_Detected	0m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Uncorrectable Errors detected.
Node-Filesystem_Device_Error	0m	critical	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" filesystem error.
Node-Disk_Queue_Length_High	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has disk queue length greater than 1 for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_IO_Time_Weighted_Seconds	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has high disk io queue (aqu-sq) for more than 10 minutes. Value: {{ humanize $value }}

`v0.1.5`: sylva-thanos-rules: 0.1.5

Compare Source

Merge Requests integrated in this release

Contributors

sylva-thanos-rules

Generate ConfigMap object for consumption by Thanos Ruler

Details about rules

rules/_helper_kubernetes_metadata.yml

Alert Name	For	Severity	Type	Description
k8s-Metamonitoring_configuration_error_kube_namespace_labels	45m	error	deployment	Metric "kube_namespace_labels" from cluster "{{ $labels.capi_cluster_name }}" is not exposed by "kube-state-metrics".
k8s-Metamonitoring_configuration_error_rancher_project_info	45m	error	deployment	Metric "rancher_project_info" from the management cluster is not exposed by "kube-state-metrics".

rules/clusters_state_rules.yml

Alert Name	For	Severity	Type	Description
Sylva_cluster_Prometheus_not_Sending_Data_management	45m	critical	deployment	Prometheus server from the management cluster has not sent data in the last 45m.
Sylva_cluster_Prometheus_not_Sending_Data	45m	critical	deployment	Prometheus server from cluster "{{ $labels.capi_cluster_name }}" in namespace "{{ $labels.capi_cluster_namespace }}" has not sent data in the last 45m.
Sylva_clusters_different_number	45m	critical	deployment	Some cluster is not properly provisioned in Rancher, check all clusters to see if cattle-agent is properly deployed
Sylva_clusters_metric_absent	45m	error	deployment	Metric "capi_cluster_info" from the management cluster is not exposed by "kube-state-metrics".

rules/kubernetes_capacity.yml

Alert Name	For	Severity	Type	Description
k8s-Cluster_CPU_Overcommitted	5m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable CPUs. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Memory_Overcommitted	5m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable Memory. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Too_Many_Pods	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Node_Too_Many_Pods	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" node {{ $labels.node }} number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Kube_Quota_Almost_Full	15m	warning	k8s	Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value
k8s-Kube_Quota_Exceeded	15m	error	k8s	Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value

rules/kubernetes_cluster_components.yml

Alert Name	For	Severity	Type	Description
k8s-Version_Mismatch	4h	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has different versions of Kubernetes components running. Value: {{ $value }}
k8s-Client_Errors	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server client "{{ $labels.instance }}" job "{{ $labels.job }}" is experiencing errors. Value: {{ printf "%0.0f" $value }}%
k8s-Client_Certificate_Expiration	5m	warning	k8s	A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 7 days on cluster {{ $labels.cluster }}.
k8s-Client_Certificate_Expiration	5m	critical	k8s	A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 24h on cluster {{ $labels.cluster }}.
k8s-API_Global_Error_Rate_High	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for over 3% of requests. Value: {{ humanize $value }}%
k8s-API_Error_Rate_High	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for 10% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. Value: {{ humanize $value }}%
k8s-Aggregated_API_Errors	15m	warning	k8s	Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has reported errors. It has appeared unavailable {{ $value
k8s-Aggregated_API_Down	15m	warning	k8s	Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has been only {{ $value
k8s-API_Endpoint_Down	15m	error	k8s	Kubernetes API endpoint {{ $labels.instance }} in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Down	15m	critical	k8s	Kubernetes API in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Terminated_Requests	15m	warning	k8s	Kubernetes API in cluster {{ $labels.cluster }} has terminated {{ $value

rules/kubernetes_jobs.yml

Alert Name	For	Severity	Type	Description
k8s-CronJob_Status_Failed	5m	warning	k8s	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" failed. Last job has failed multiple times. Value: {{ $value }}
k8s-CronJob_Taking_Too_Long	0m	warning	k8s	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" is taking too long to completes - is over deadline. Value: {{ humanizeDuration $value }}
k8s-Job_not_Completed	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" is taking more than 12h to complete.
k8s-Job_Failed	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" failed to complete. Removing failed job after investigation should clear this alert.

rules/kubernetes_nodes.yml

Alert Name	For	Severity	Type	Description
k8s-Node_Kubelet_Down	5m	critical	k8s	Kubelet on {{ $labels.node }} in cluster "{{ $labels.cluster }}" is not reachable
k8s-Node_Status_OutOfDisk	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is almost out of disk space
k8s-Node_Status_MemoryPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under memory pressure.
k8s-Node_Status_DiskPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under disk pressure
k8s-Node_Status_PIDPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under PID pressure
k8s-Node_Status_NotReady	5m	critical	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has been not been Ready for more than an hour
k8s-Node_Status_NetworkUnavailable	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has NetworkUnavailable condition.
k8s-Node_Status_Ready_flapping	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" readiness status changed {{ $value }} times in the last 15 minutes.

rules/kubernetes_pods.yml

Alert Name	For	Severity	Type	Description
k8s-Pod_Status_not_Ready	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in a non-ready state for longer than 15 minutes.
k8s-Pod_Status_OOMKilled	0m	critical	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" has been restarted due to OOMKilled reason in the last hour. Value: {{ humanize $value }}
k8s-Pod_Status_Crashlooping	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Init_Container_Status_Crashlooping	15m	warning	k8s	Init Container from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Container_Status_Waiting	1h	warning	k8s	Container "{{ $labels.container }}" from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in waiting state for longer than 1 hour.
k8s-Statefulset_Replicas_not_Ready	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Statefulset_Generation_Mismatch	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" generation for does not match. This indicates that the StatefulSet has failed but has not been rolled back
k8s-Statefulset_Update_not_Rolled_Out	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" update has not been rolled out
k8s-Statefulset_Replicas_Mismatch	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf `kube_statefulset_spec_replicas{statefulset="%s", cluster="%s"}` $labels.statefulset $labels.cluster
k8s-Statefulset_Replicas_not_Updated	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-ReplicaSet_Replicas_Mismatch	15m	warning	k8s	ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas
k8s-Deployment_Replicas_not_Ready	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Deployment_Replicas_Mismatch	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf `kube_deployment_spec_replicas{deployment="%s", cluster="%s"}` $labels.deployment $labels.cluster
k8s-Deployment_Generation_Mismatch	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" generation does not match expected one
k8s-Deployment_Replicas_not_Updated	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-Deployment_Rollout_Stuck	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" is not progressing for longer than 15 minutes.
k8s-Daemonset_Rollout_Stuck	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has less than 100% of desired pods scheduled and ready. Value: {{ humanize $value }}%
k8s-Daemonset_not_Scheduled	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has unscheduled pods. Value: {{ humanize $value }}
k8s-Daemonset_Misscheduled	15m	warning	k8s	Daemonset pods {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" are running where they are not supposed to. Value: {{ humanize $value }}
k8s-Daemonset_Generation_Mismatch	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" generation does not match expected one

rules/kubernetes_storage.yml

Alert Name	For	Severity	Type	Description
k8s-Persistent_Volume_Disk_Space_Usage_High	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Disk_Space_Usage_High	5m	critical	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Full_in_4_days	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" will fill up in 4 days at the current rate of utilization. Value: {{ printf "%0.2f" $value }}% available
k8s-Persistent_Volume_Inodes_Usage_High	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Inodes_Usage_High	5m	critical	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Errors	5m	warning	k8s	PersistentVolume "{{ $labels.persistentvolume }}" in cluster "{{ $labels.cluster }}" has status "{{ $labels.phase }}"
k8s-Persistent_Volume_Claim_Orphan	3h	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" is not used by any pod

rules/longhorn.yml

Alert Name	For	Severity	Type	Description
Longhorn-Volume_Status_Critical	5m	critical	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Faulted" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Warning	5m	warning	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Degraded" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Unknown	5m	warning	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Unknown" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Node_Storage_Warning	5m	warning	storage	The used storage of node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Disk_Storage_Warning	5m	warning	storage	The used storage of disk "{{ $labels.disk }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Node_Down	5m	critical	storage	There are "{{ $value
Longhorn-Instance_Manager_CPU_Usage_Warning	5m	warning	storage	Longhorn instance manager "{{ $labels.instance_manager }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU request at "{{ $value
Longhorn-Node_CPU_Usage_Warning	5m	warning	storage	Longhorn node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU capacity at "{{ $value

rules/metallb.yml

Alert Name	For	Severity	Type	Description
MetalLB-BGP_Session_Down	5m	error	network	MetalLB speaker {{ $labels.instance }} in cluster "{{ $labels.cluster}}" has BGP session {{ $labels.peer }} down for more than 5 minutes.
MetalLB-BGP_All_Sessions_Down	5m	critical	network	MetalLB in "{{ $labels.cluster}}" all {{ $value }} BGP sessions are down for more than 5 minutes.
MetalLB-Address_Pool_High_Usage	5m	info	network	MetalLB pool "{{ $labels.pool }}" in cluster "{{ $labels.cluster}}" has more than 75% of the total addresses used.
MetalLB-Config_Stale	5m	warning	network	MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" has a stale config.
MetalLB-Config_not_Loaded	5m	warning	network	MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" config not loaded.

rules/node_exporter.yml

Alert Name	For	Severity	Type	Description
Node-Recently_Rebooted	0m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" has been rebooted in the last 30 minutes. Value: {{ humanizeDuration $value }} uptime.
Node-CPU_High_Usage	30m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU usage has exceeded the threshold of 90% for more than 30 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait	15m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-Memory_Major_Pages_Faults	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - memory major page faults are occurring at very high rate. Value: {{ humanize $value }}
Node-Memory_High_Usage	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Memory_High_Usage	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 90% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 80% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage	15m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 90% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Will_Fill_Up_In_4h	5m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" will fill up in 4 hours at the current rate of utilization. Value: {{ printf `node_filesystem_avail_bytes{fstype=~"ext.*
Node-High_Disk_Inodes_High_Usage	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 60% used inodes for more than 15m. Value: {{ humanize $value }}
Node-High_Disk_Inodes_High_Usage	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 70% used inodes for more than 15m. Value: {{ humanize $value }}
Node-Load_High	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - system load per core is above 2 for the last 15 minutes. This might indicate this instance resources saturation and can cause it becoming unresponsive. Value: {{ humanize $value }}
Node-fds_Near_Limit_Process	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - "{{ $labels.job }}" has more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-fds_Near_Limit	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-Network_High_Receive_Drop	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network reception. Value: {{ humanize $value }}
Node-Network_High_Transmit_Drop	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network transmission. Value: {{ humanize $value }}
Node-Network_High_Receive_Errors	30m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered receive errors. Value: {{ humanize $value }}
Node-Network_High_Transmit_Errors	30m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered transmit errors. Value: {{ humanize $value }}
Node-Network_Interface_Flapping	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" changing its up status often. Value: {{ humanize $value }}
Node-Too_Many_OOM_Kills	0m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - several OOM kills detected in the past 1h. Value: {{ humanize $value }}. Find out which process by running `dmesg
Node-Clock_Not_Synchronising	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is not synchronising. Ensure NTP is configured on this host.
Node-Clock_Skew_Detected	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host.
Node-Host_Conntrack_Limit	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 75% of conntrack entries are used.
Node-EDAC_Correctable_Errors_Detected	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Correctable Errors detected.
Node-EDAC_Uncorrectable_Errors_Detected	0m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Uncorrectable Errors detected.
Node-Filesystem_Device_Error	0m	critical	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" filesystem error.
Node-Disk_Queue_Length_High	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has disk queue length greater than 1 for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_IO_Time_Weighted_Seconds	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has high disk io queue (aqu-sq) for more than 10 minutes. Value: {{ humanize $value }}

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻️ Rebasing: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this MR and you won't be reminded about this update again.

If you want to rebase/retry this MR, check this box

This MR has been generated by Renovate Bot Sylva instance.

CI configuration couldn't be handle by MR description. A dedicated comment has been posted to control it.

If no checkbox is checked, a default pipeline will be enabled (capm3, or capo if capo label is set)

Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git to v0.2.0 (main)

Release Notes

v0.2.0: sylva-thanos-rules: 0.2.0

Merge Requests integrated in this release

Contributors

sylva-thanos-rules

Generate ConfigMap object for consumption by Thanos Ruler

rules/_helper_kubernetes_metadata.yml

rules/clusters_state_rules.yml

rules/kubernetes_capacity.yml

rules/kubernetes_cluster_components.yml

rules/kubernetes_jobs.yml

rules/kubernetes_nodes.yml

rules/kubernetes_pods.yml

rules/kubernetes_storage.yml

rules/longhorn.yml

rules/metallb.yml

rules/node_exporter.yml

v0.1.5: sylva-thanos-rules: 0.1.5

Merge Requests integrated in this release

Contributors

sylva-thanos-rules

Generate ConfigMap object for consumption by Thanos Ruler

rules/_helper_kubernetes_metadata.yml

rules/clusters_state_rules.yml

rules/kubernetes_capacity.yml

rules/kubernetes_cluster_components.yml

rules/kubernetes_jobs.yml

rules/kubernetes_nodes.yml

rules/kubernetes_pods.yml

rules/kubernetes_storage.yml

rules/longhorn.yml

rules/metallb.yml

rules/node_exporter.yml

Configuration

Merge request reports

`v0.2.0`: sylva-thanos-rules: 0.2.0

`v0.1.5`: sylva-thanos-rules: 0.1.5