Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git to v0.3.0 (main) (!5591) · Merge requests · Sylva-projects / sylva-core

This MR contains the following updates:

Package	Update	Change
https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git	minor	`0.2.3` -> `0.3.0`

⚠️ Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.

Release Notes

sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git)

`v0.3.0`: sylva-thanos-rules: 0.3.0

Compare Source

Merge Requests integrated in this release

3 merge requests were integrated in this repo between 0.2.2 and 0.3.0. These notes don't account for the MRs merged in secondary repos.

Other

Add monitoring stack alerting rules !109

CI

Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.43 renovate !106 !110

Contributors

1 person contributed.

Alin H

sylva-thanos-rules

Generate ConfigMap object for consumption by Thanos Ruler

Details about rules

rules/_helper_kubernetes_metadata.yml

Alert Name	For	Severity	Type	Description
k8s-Metamonitoring_configuration_error_kube_namespace_labels	45m	error	deployment	Metric "kube_namespace_labels" from cluster "{{ $labels.capi_cluster_name }}" is not exposed by "kube-state-metrics".
k8s-Metamonitoring_configuration_error_rancher_project_info	45m	error	deployment	Metric "rancher_project_info" from the management cluster is not exposed by "kube-state-metrics".

rules/clusters_state_rules.yml

Alert Name	For	Severity	Type	Description
Sylva_cluster_Prometheus_not_Sending_Data_management	45m	critical	deployment	Prometheus server from the management cluster has not sent data in the last 45m.
Sylva_cluster_Prometheus_not_Sending_Data	45m	critical	deployment	Prometheus server from cluster "{{ $labels.capi_cluster_name }}" in namespace "{{ $labels.capi_cluster_namespace }}" has not sent data in the last 45m.
Sylva_clusters_different_number	45m	critical	deployment	Some cluster is not properly provisioned in Rancher, check all clusters to see if cattle-agent is properly deployed
Sylva_clusters_metric_absent	45m	error	deployment	Metric "capi_cluster_info" from the management cluster is not exposed by "kube-state-metrics".

rules/etcd.yml

Alert Name	For	Severity	Type	Description
etcd-Members_Down	5m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" members are down.
etcd-Members_Insufficient	5m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" has insufficient members. Value: {{ $value }}
etcd-Members_No_Leader	5m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" member {{ $labels.instance }} has no leader.
etcd-High_Number_of_Leader_Changes	5m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.
etcd-gRPC_High_Number_of_Failed_Requests	10m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" gRPC requests failed for "{{ $labels.grpc_method }}". Value: {{ $value }}
etcd-gRPC_High_Number_of_Failed_Requests	5m	critical	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" gRPC requests failed for "{{ $labels.grpc_method }}". Value: {{ $value }}
etcd-Members_Communication_Slow	10m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" "{{ $labels.instance }}" to "{{ $labels.To }}" member communication is taking too long. Value: {{ $value }}s.
etcd-High_Number_of_Failed_Proposals	15m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has proposal failures within the last 30 minutes. Value: {{ $value }}
etcd-High_Fsync_Duration	10m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has high 99th percentile fync durations. Value: {{ $value }}s
etcd-High_Commit_Duration	10m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has high 99th percentile commit durations. Value: {{ $value }}s
etcd-HTTP_High_Number_of_Failed_Requests	10m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests failed for "{{ $labels.method }}". Value: {{ $value }}
etcd-HTTP_High_Number_of_Failed_Requests	10m	critical	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests failed for "{{ $labels.method }}". Value: {{ $value }}
etcd-HTTP_Requests_Slow	10m	warning	etcd	etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests for "{{ $labels.method }}" are slow. Value: {{ $value }}s

rules/kubernetes_capacity.yml

Alert Name	For	Severity	Type	Description
k8s-Cluster_CPU_Overcommitted	5m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable CPUs. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Memory_Overcommitted	5m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable Memory. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Too_Many_Pods	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Node_Too_Many_Pods	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" node {{ $labels.node }} number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Kube_Quota_Almost_Full	15m	warning	k8s	Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value
k8s-Kube_Quota_Exceeded	15m	error	k8s	Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value

rules/kubernetes_cluster_components.yml

Alert Name	For	Severity	Type	Description
k8s-Version_Mismatch	4h	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" has different versions of Kubernetes components running. Value: {{ $value }}
k8s-Client_Errors	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server client "{{ $labels.instance }}" job "{{ $labels.job }}" is experiencing errors. Value: {{ printf "%0.0f" $value }}%
k8s-Client_Certificate_Expiration	5m	warning	k8s	A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 7 days on cluster {{ $labels.cluster }}.
k8s-Client_Certificate_Expiration	5m	critical	k8s	A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 24h on cluster {{ $labels.cluster }}.
k8s-API_Global_Error_Rate_High	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for over 3% of requests. Value: {{ humanize $value }}%
k8s-API_Error_Rate_High	15m	warning	k8s	Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for 10% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. Value: {{ humanize $value }}%
k8s-Aggregated_API_Errors	15m	warning	k8s	Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has reported errors. It has appeared unavailable {{ $value
k8s-Aggregated_API_Down	15m	warning	k8s	Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has been only {{ $value
k8s-API_Endpoint_Down	15m	error	k8s	Kubernetes API endpoint {{ $labels.instance }} in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Down	15m	critical	k8s	Kubernetes API in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Terminated_Requests	15m	warning	k8s	Kubernetes API in cluster {{ $labels.cluster }} has terminated {{ $value

rules/kubernetes_jobs.yml

Alert Name	For	Severity	Type	Description
k8s-CronJob_Status_Failed	5m	warning	k8s	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" failed. Last job has failed multiple times. Value: {{ $value }}
k8s-CronJob_Taking_Too_Long	0m	warning	k8s	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" is taking too long to completes - is over deadline. Value: {{ humanizeDuration $value }}
k8s-Job_not_Completed	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" is taking more than 12h to complete.
k8s-Job_Failed	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" failed to complete. Removing failed job after investigation should clear this alert.

rules/kubernetes_nodes.yml

Alert Name	For	Severity	Type	Description
k8s-Node_Kubelet_Down	5m	critical	k8s	Kubelet on {{ $labels.node }} in cluster "{{ $labels.cluster }}" is not reachable
k8s-Node_Status_OutOfDisk	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is almost out of disk space
k8s-Node_Status_MemoryPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under memory pressure.
k8s-Node_Status_DiskPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under disk pressure
k8s-Node_Status_PIDPressure	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under PID pressure
k8s-Node_Status_NotReady	5m	error	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has been not been Ready for more than an hour
k8s-Node_Status_NetworkUnavailable	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has NetworkUnavailable condition.
k8s-Node_Status_Ready_flapping	5m	warning	k8s	Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" readiness status changed {{ $value }} times in the last 15 minutes.

rules/kubernetes_pods.yml

Alert Name	For	Severity	Type	Description
k8s-Pod_Status_not_Ready	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in a non-ready state for longer than 15 minutes.
k8s-Pod_Status_OOMKilled	0m	error	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" has been restarted due to OOMKilled reason in the last hour. Value: {{ humanize $value }}
k8s-Pod_Status_Crashlooping	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Init_Container_Status_Crashlooping	15m	warning	k8s	Init Container from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Container_Status_Waiting	1h	warning	k8s	Container "{{ $labels.container }}" from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in waiting state for longer than 1 hour.
k8s-Statefulset_Replicas_not_Ready	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Statefulset_Generation_Mismatch	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" generation for does not match. This indicates that the StatefulSet has failed but has not been rolled back
k8s-Statefulset_Update_not_Rolled_Out	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" update has not been rolled out
k8s-Statefulset_Replicas_Mismatch	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf `kube_statefulset_spec_replicas{statefulset="%s", cluster="%s"}` $labels.statefulset $labels.cluster
k8s-Statefulset_Replicas_not_Updated	15m	warning	k8s	Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-ReplicaSet_Replicas_Mismatch	15m	warning	k8s	ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas
k8s-Deployment_Replicas_not_Ready	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Deployment_Replicas_Mismatch	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf `kube_deployment_spec_replicas{deployment="%s", cluster="%s"}` $labels.deployment $labels.cluster
k8s-Deployment_Generation_Mismatch	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" generation does not match expected one
k8s-Deployment_Replicas_not_Updated	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-Deployment_Rollout_Stuck	15m	warning	k8s	Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" is not progressing for longer than 15 minutes.
k8s-Daemonset_Rollout_Stuck	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has less than 100% of desired pods scheduled and ready. Value: {{ humanize $value }}%
k8s-Daemonset_not_Scheduled	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has unscheduled pods. Value: {{ humanize $value }}
k8s-Daemonset_Misscheduled	15m	warning	k8s	Daemonset pods {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" are running where they are not supposed to. Value: {{ humanize $value }}
k8s-Daemonset_Generation_Mismatch	15m	warning	k8s	Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" generation does not match expected one
k8s-Pod_Resource_Allocation_Requests_Too_Low_CPU	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 80% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_Low_CPU	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 90% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_CPU	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 80% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_CPU	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 90% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_Low_Memory	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 80% of the resource requets. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_Low_Memory	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 90% of the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_Memory	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 80% of the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_Memory	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 90% of the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_CPU	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 20% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_CPU	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 10% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_CPU	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 20% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_CPU	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 10% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_Memory	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 20% of the resource requets. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_Memory	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 10% of the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_Memory	15m	info	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 20% of the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_Memory	15m	warning	k8s	Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 10% of the resource limits. Consider updating the allocated value.

rules/kubernetes_storage.yml

Alert Name	For	Severity	Type	Description
k8s-Persistent_Volume_Disk_Space_Usage_High	5m	info	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Disk_Space_Usage_High	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Full_in_4_days	5m	info	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" will fill up in 4 days at the current rate of utilization. Value: {{ printf "%0.2f" $value }}% available
k8s-Persistent_Volume_Inodes_Usage_High	5m	info	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Inodes_Usage_High	5m	warning	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Errors	5m	warning	k8s	PersistentVolume "{{ $labels.persistentvolume }}" in cluster "{{ $labels.cluster }}" has status "{{ $labels.phase }}"
k8s-Persistent_Volume_Claim_Orphan	3h	info	k8s	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" is not used by any pod

rules/kubernetes_storage_ephemeral.yml

Alert Name	For	Severity	Type	Description
k8s-Ephemeral_Storage_Container_Usage_at_Limit	5m	warning	k8s	Ephemeral storage usage of pod/container "{{ $labels.pod_name }}"/"{{ $labels.exported_container }}" in namespace "{{ $labels.pod_namespace }}" on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is at {{ $value }}% of the limit.
k8s-Ephemeral_Storage_Container_Usage_Reaching_Limit	15m	warning	k8s	Ephemeral storage limit of pod/container "{{ $labels.pod_name }}"/"{{ $labels.exported_container }}" in namespace "{{ $labels.pod_namespace }}" on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is expected to be reached within 12 hours. Currently, {{ $value }}% is used.
k8s-Ephemeral_Storage_Volume_Filled_Up	5m	warning	k8s	Ephemeral storage volume "{{ $labels.volume_name }}" of pod "{{ $labels.pod_name }}" in namespace "{{ $labels.pod_namespace }}" in cluster "{{ $labels.cluster }}" high usage. Value: {{ $value }}%
k8s-Ephemeral_Storage_Volume_Filling_Up	5m	warning	k8s	Ephemeral storage volume "{{ $labels.volume_name }}" of pod "{{ $labels.pod_name }}" in namespace "{{ $labels.pod_namespace }}" in cluster "{{ $labels.cluster }}" is expected to be filled up within 12 hours. Currently, {{ $value }}% is used
k8s-Ephemeral_Storage_on_Node_Filling_Up	5m	warning	k8s	Ephemeral storage on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is is gearter than 75%. Value: {{ $value }}%
k8s-Ephemeral_Storage_on_Node_Filling_Up	5m	warning	k8s	Ephemeral storage on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is is gearter than 90%. Value: {{ $value }}%

rules/longhorn.yml

Alert Name	For	Severity	Type	Description
Longhorn-Volume_Status_Critical	5m	error	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Faulted" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Warning	5m	warning	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Degraded" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Unknown	5m	warning	storage	Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Unknown" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Node_Storage_Warning	5m	warning	storage	The used storage of node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Disk_Storage_Warning	5m	warning	storage	The used storage of disk "{{ $labels.disk }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Node_Down	5m	error	storage	There are "{{ $value
Longhorn-Instance_Manager_CPU_Usage_Warning	5m	info	storage	Longhorn instance manager "{{ $labels.instance_manager }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU request at "{{ $value
Longhorn-Node_CPU_Usage_Warning	5m	info	storage	Longhorn node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU capacity at "{{ $value

rules/metallb.yml

Alert Name	For	Severity	Type	Description
MetalLB-BGP_Session_Down	5m	error	network	MetalLB speaker {{ $labels.instance }} in cluster "{{ $labels.cluster}}" has BGP session {{ $labels.peer }} down for more than 5 minutes.
MetalLB-BGP_All_Sessions_Down	5m	critical	network	MetalLB in "{{ $labels.cluster}}" all {{ $value }} BGP sessions are down for more than 5 minutes.
MetalLB-Address_Pool_High_Usage	5m	info	network	MetalLB pool "{{ $labels.pool }}" in cluster "{{ $labels.cluster}}" has more than 75% of the total addresses used.
MetalLB-Config_Stale	5m	warning	network	MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" has a stale config.
MetalLB-Config_not_Loaded	5m	warning	network	MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" config not loaded.

rules/monitoring_stack_components.yml

Alert Name	For	Severity	Type	Description
Monitoring-Prometheus_Bad_Config	10m	critical	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to load its configuration.
Monitoring-Prometheus_SD_Refresh_Failure	20m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to refresh SD with mechanism "{{ $labels.mechanism }}".
Monitoring-Prometheus_Kubernetes_List_Watch_Failures	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" service discovery is experiencing failures with LIST/WATCH requests to the Kubernetes API in the last 5 minutes. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Notification_Queue_Running_Full	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" alert notification queue predicted to run full in less than 30 minutes.
Monitoring-Prometheus_Error_Sending_Alerts_to_Some_Alertmanagers	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has {{ printf "%.1f" $value }}% of alerts sent to Alertmanager "{{ $labels.alertmanager }}" affected by errors.
Monitoring-Prometheus_not_Connected_to_Alertmanagers	10m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is not connected to any Alertmanagers.
Monitoring-Prometheus_TSDB_Reloads_Failing	4h	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" experienced TSDB reload failures in the last 3h. Value: {{ $value }}
Monitoring-Prometheus_TSDB_Compactions_Failing	4h	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" experienced compaction failures in the last 3h. Value: {{ $value }}
Monitoring-Prometheus_not_Ingesting_Samples	10m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is not ingesting samples.
Monitoring-Prometheus_Duplicate_Timestamps	10m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is dropping samples/s with different values but duplicated timestamps. Value: {{ printf "%.4g" $value }}
Monitoring-Prometheus_Out_of_Order_Timestamps	10m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is dropping samples/s with timestamps arriving out of order. Value: {{ printf "%.4g" $value }}
Monitoring-Prometheus_Remote_Storage_Failures	15m	critical	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to send a high number of samples to "{{ $labels.remote_name }}:{{ $labels.url }}". Value: {{ printf "%.1f" $value }}%
Monitoring-Prometheus_Remote_Write_Behind	15m	critical	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" remote write is behind for "{{ $labels.remote_name }}:{{ $labels.url }}". Value: {{ printf "%.1f" $value }}s
Monitoring-Prometheus_Remote_Write_Desired_Shards	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" remote write desired shards calculation wants to run {{ $value }} shards for queue "{{ $labels.remote_name}}:{{ $labels.url }}", which is more than the max of "{{ printf `prometheus_remote_storage_shards_max{instance="%s"}` $labels.instance
Monitoring-Prometheus_Rule_Failures	15m	critical	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to evaluate rules in the last 5m in rule group "{{ $labels.rule_group }}". Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Missing_Rule_Evaluations	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" missed rule group evaluations in the last 5m in rule group "{{ $labels.rule_group }}". Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Target_Limit_Hit	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" dropped targets because the number of targets exceeds the configured target_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Label_Limit_Hit	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" dropped targets because some samples exceeded the configured label_limit, label_name_length_limit or label_value_length_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Scrape_Body_Size_Limit_Hit	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed scrapes in the last 5m because some targets exceeded the configured body_size_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Scrape_Sample_Limit_Hit	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed scrapes in the last 5m because some targets exceeded the configured sample_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Target_Sync_Failure	5m	critical	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" targets failed to sync because invalid configuration was supplied. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_High_Query_Load	15m	warning	monitoring	Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" query API has less than 20% available capacity in its query engine for the last 15 minutes.
Monitoring-PrometheusOperator_List_Errors	15m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has errors while performing "List" operations.
Monitoring-PrometheusOperator_Watch_Errors	15m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has errors while performing "Watch" operations.
Monitoring-PrometheusOperator_Sync_Failed	10m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed sync operations for {{ $value }} objects.
Monitoring-PrometheusOperator_Reconcile_Errors	10m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed reconciling operations. Value: {{ $value
Monitoring-PrometheusOperator_Status_Update_Errors	10m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed status update operations. Value: {{ $value
Monitoring-PrometheusOperator_Node_Lookup_Errors	10m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" errors while reconciling Prometheus.
Monitoring-PrometheusOperator_Not_Ready	5m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" is not ready to reconcile resources.
Monitoring-PrometheusOperator_Rejected_Resources	5m	warning	monitoring	Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" rejected "{{ $labels.resource }}" resources. Value: {{ printf "%0.0f" $value }}
Monitoring-Alertmanager_Failed_Reload	10m	critical	monitoring	Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to load its configuration.
Monitoring-Alertmanager_Members_Inconsistent	15m	critical	monitoring	Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has only found {{ $value }} members of the Alertmanager cluster.
Monitoring-Alertmanager_Failed_to_Send_Alerts	5m	warning	monitoring	Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to send {{ $value
Monitoring-Alertmanager_Cluster_Failed_to_Send_Alerts	5m	critical	monitoring	Alertmanager in cluster "{{ $labels.cluster }}" has high notification failure rate to "{{ $labels.integration }}". Value: {{ $value
Monitoring-Alertmanager_Config_Inconsistent	20m	critical	monitoring	Alertmanager instances in cluster "{{ $labels.cluster }}" have different configurations.

rules/node_exporter.yml

Alert Name	For	Severity	Type	Description
Node-Recently_Rebooted	0m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" has been rebooted in the last 30 minutes. Value: {{ humanizeDuration $value }} uptime.
Node-CPU_High_Usage	30m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU usage has exceeded the threshold of 90% for more than 30 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait	15m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-Memory_Major_Pages_Faults	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - memory major page faults are occurring at very high rate. Value: {{ humanize $value }}
Node-Memory_High_Usage	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Memory_High_Usage	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 90% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 80% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage	15m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 90% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Will_Fill_Up_In_4h	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" will fill up in 4 hours at the current rate of utilization. Value: {{ printf `node_filesystem_avail_bytes{fstype=~"ext.*
Node-High_Disk_Inodes_High_Usage	15m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 60% used inodes for more than 15m. Value: {{ humanize $value }}
Node-High_Disk_Inodes_High_Usage	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 70% used inodes for more than 15m. Value: {{ humanize $value }}
Node-Load_High	15m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - system load per core is above 2 for the last 15 minutes. This might indicate this instance resources saturation and can cause it becoming unresponsive. Value: {{ humanize $value }}
Node-fds_Near_Limit_Process	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - "{{ $labels.job }}" has more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-fds_Near_Limit	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-Network_High_Receive_Drop	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network reception. Value: {{ humanize $value }}
Node-Network_High_Transmit_Drop	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network transmission. Value: {{ humanize $value }}
Node-Network_High_Receive_Errors	30m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered receive errors. Value: {{ humanize $value }}
Node-Network_High_Transmit_Errors	30m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered transmit errors. Value: {{ humanize $value }}
Node-Network_Interface_Flapping	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" changing its up status often. Value: {{ humanize $value }}
Node-Network_Bond_Interface_Misconfigured	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - bond "{{ $labels.master }}" is misconfigured. Check bonding slaves configuration.
Node-Network_Bond_Interface_Down	5m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - bond "{{ $labels.master }}" has interface(s) down. Value: {{ $value }}
Node-Too_Many_OOM_Kills	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - several OOM kills detected in the past 1h. Value: {{ humanize $value }}. Find out which process by running `dmesg
Node-Clock_Not_Synchronising	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is not synchronising. Ensure NTP is configured on this host.
Node-Clock_Skew_Detected	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host.
Node-Host_Conntrack_Limit	10m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 75% of conntrack entries are used.
Node-EDAC_Correctable_Errors_Detected	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Correctable Errors detected.
Node-EDAC_Uncorrectable_Errors_Detected	0m	error	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Uncorrectable Errors detected.
Node-Filesystem_Device_Error	0m	warning	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" filesystem error.
Node-Disk_Queue_Length_High	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has disk queue length greater than 1 for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_IO_Time_Weighted_Seconds	10m	info	system	Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has high disk io queue (aqu-sq) for more than 10 minutes. Value: {{ humanize $value }}

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻️ Rebasing: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this MR and you won't be reminded about this update again.

If you want to rebase/retry this MR, check this box

This MR has been generated by Renovate Bot Sylva instance.

CI configuration couldn't be handle by MR description. A dedicated comment has been posted to control it.

If no checkbox is checked, a default pipeline will be enabled (capm3, or capo if capo label is set)

Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git to v0.3.0 (main)