Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git to v0.3.0 (main)
This MR contains the following updates:
| Package | Update | Change |
|---|---|---|
| https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git | minor |
0.2.3 -> 0.3.0
|
⚠️ WarningSome dependencies could not be looked up. Check the Dependency Dashboard for more information.
Release Notes
sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git)
v0.3.0: sylva-thanos-rules: 0.3.0
Merge Requests integrated in this release
3 merge requests were integrated in this repo between 0.2.2 and 0.3.0. These notes don't account for the MRs merged in secondary repos.
Other
- Add monitoring stack alerting rules !109
CI
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.43 renovate !106 !110
Contributors
1 person contributed.
sylva-thanos-rules
Generate ConfigMap object for consumption by Thanos Ruler
Details about rules
rules/_helper_kubernetes_metadata.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Metamonitoring_configuration_error_kube_namespace_labels | 45m | error | deployment | Metric "kube_namespace_labels" from cluster "{{ $labels.capi_cluster_name }}" is not exposed by "kube-state-metrics". |
| k8s-Metamonitoring_configuration_error_rancher_project_info | 45m | error | deployment | Metric "rancher_project_info" from the management cluster is not exposed by "kube-state-metrics". |
rules/clusters_state_rules.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Sylva_cluster_Prometheus_not_Sending_Data_management | 45m | critical | deployment | Prometheus server from the management cluster has not sent data in the last 45m. |
| Sylva_cluster_Prometheus_not_Sending_Data | 45m | critical | deployment | Prometheus server from cluster "{{ $labels.capi_cluster_name }}" in namespace "{{ $labels.capi_cluster_namespace }}" has not sent data in the last 45m. |
| Sylva_clusters_different_number | 45m | critical | deployment | Some cluster is not properly provisioned in Rancher, check all clusters to see if cattle-agent is properly deployed |
| Sylva_clusters_metric_absent | 45m | error | deployment | Metric "capi_cluster_info" from the management cluster is not exposed by "kube-state-metrics". |
rules/etcd.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| etcd-Members_Down | 5m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" members are down. |
| etcd-Members_Insufficient | 5m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" has insufficient members. Value: {{ $value }} |
| etcd-Members_No_Leader | 5m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" member {{ $labels.instance }} has no leader. |
| etcd-High_Number_of_Leader_Changes | 5m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated. |
| etcd-gRPC_High_Number_of_Failed_Requests | 10m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" gRPC requests failed for "{{ $labels.grpc_method }}". Value: {{ $value }} |
| etcd-gRPC_High_Number_of_Failed_Requests | 5m | critical | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" gRPC requests failed for "{{ $labels.grpc_method }}". Value: {{ $value }} |
| etcd-Members_Communication_Slow | 10m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" "{{ $labels.instance }}" to "{{ $labels.To }}" member communication is taking too long. Value: {{ $value }}s. |
| etcd-High_Number_of_Failed_Proposals | 15m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has proposal failures within the last 30 minutes. Value: {{ $value }} |
| etcd-High_Fsync_Duration | 10m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has high 99th percentile fync durations. Value: {{ $value }}s |
| etcd-High_Commit_Duration | 10m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has high 99th percentile commit durations. Value: {{ $value }}s |
| etcd-HTTP_High_Number_of_Failed_Requests | 10m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests failed for "{{ $labels.method }}". Value: {{ $value }} |
| etcd-HTTP_High_Number_of_Failed_Requests | 10m | critical | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests failed for "{{ $labels.method }}". Value: {{ $value }} |
| etcd-HTTP_Requests_Slow | 10m | warning | etcd | etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests for "{{ $labels.method }}" are slow. Value: {{ $value }}s |
rules/kubernetes_capacity.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Cluster_CPU_Overcommitted | 5m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable CPUs. Node failures may cause Pods to be unschadulable due to lack of resources |
| k8s-Cluster_Memory_Overcommitted | 5m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable Memory. Node failures may cause Pods to be unschadulable due to lack of resources |
| k8s-Cluster_Too_Many_Pods | 15m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" number of pods over 90% of Pod number limit. Value: {{ humanize $value }}% |
| k8s-Node_Too_Many_Pods | 15m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" node {{ $labels.node }} number of pods over 90% of Pod number limit. Value: {{ humanize $value }}% |
| k8s-Kube_Quota_Almost_Full | 15m | warning | k8s | Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value |
| k8s-Kube_Quota_Exceeded | 15m | error | k8s | Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value |
rules/kubernetes_cluster_components.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Version_Mismatch | 4h | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" has different versions of Kubernetes components running. Value: {{ $value }} |
| k8s-Client_Errors | 15m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" API server client "{{ $labels.instance }}" job "{{ $labels.job }}" is experiencing errors. Value: {{ printf "%0.0f" $value }}% |
| k8s-Client_Certificate_Expiration | 5m | warning | k8s | A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 7 days on cluster {{ $labels.cluster }}. |
| k8s-Client_Certificate_Expiration | 5m | critical | k8s | A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 24h on cluster {{ $labels.cluster }}. |
| k8s-API_Global_Error_Rate_High | 15m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for over 3% of requests. Value: {{ humanize $value }}% |
| k8s-API_Error_Rate_High | 15m | warning | k8s | Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for 10% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. Value: {{ humanize $value }}% |
| k8s-Aggregated_API_Errors | 15m | warning | k8s | Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has reported errors. It has appeared unavailable {{ $value |
| k8s-Aggregated_API_Down | 15m | warning | k8s | Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has been only {{ $value |
| k8s-API_Endpoint_Down | 15m | error | k8s | Kubernetes API endpoint {{ $labels.instance }} in cluster {{ $labels.cluster }} is unreachable. |
| k8s-API_Down | 15m | critical | k8s | Kubernetes API in cluster {{ $labels.cluster }} is unreachable. |
| k8s-API_Terminated_Requests | 15m | warning | k8s | Kubernetes API in cluster {{ $labels.cluster }} has terminated {{ $value |
rules/kubernetes_jobs.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-CronJob_Status_Failed | 5m | warning | k8s | CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" failed. Last job has failed multiple times. Value: {{ $value }} |
| k8s-CronJob_Taking_Too_Long | 0m | warning | k8s | CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" is taking too long to completes - is over deadline. Value: {{ humanizeDuration $value }} |
| k8s-Job_not_Completed | 15m | warning | k8s | Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" is taking more than 12h to complete. |
| k8s-Job_Failed | 15m | warning | k8s | Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" failed to complete. Removing failed job after investigation should clear this alert. |
rules/kubernetes_nodes.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Node_Kubelet_Down | 5m | critical | k8s | Kubelet on {{ $labels.node }} in cluster "{{ $labels.cluster }}" is not reachable |
| k8s-Node_Status_OutOfDisk | 5m | warning | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is almost out of disk space |
| k8s-Node_Status_MemoryPressure | 5m | warning | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under memory pressure. |
| k8s-Node_Status_DiskPressure | 5m | warning | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under disk pressure |
| k8s-Node_Status_PIDPressure | 5m | warning | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under PID pressure |
| k8s-Node_Status_NotReady | 5m | error | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has been not been Ready for more than an hour |
| k8s-Node_Status_NetworkUnavailable | 5m | warning | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has NetworkUnavailable condition. |
| k8s-Node_Status_Ready_flapping | 5m | warning | k8s | Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" readiness status changed {{ $value }} times in the last 15 minutes. |
rules/kubernetes_pods.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Pod_Status_not_Ready | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in a non-ready state for longer than 15 minutes. |
| k8s-Pod_Status_OOMKilled | 0m | error | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" has been restarted due to OOMKilled reason in the last hour. Value: {{ humanize $value }} |
| k8s-Pod_Status_Crashlooping | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }} |
| k8s-Pod_Init_Container_Status_Crashlooping | 15m | warning | k8s | Init Container from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }} |
| k8s-Pod_Container_Status_Waiting | 1h | warning | k8s | Container "{{ $labels.container }}" from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in waiting state for longer than 1 hour. |
| k8s-Statefulset_Replicas_not_Ready | 15m | warning | k8s | Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }} |
| k8s-Statefulset_Generation_Mismatch | 15m | warning | k8s | Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" generation for does not match. This indicates that the StatefulSet has failed but has not been rolled back |
| k8s-Statefulset_Update_not_Rolled_Out | 15m | warning | k8s | Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" update has not been rolled out |
| k8s-Statefulset_Replicas_Mismatch | 15m | warning | k8s | Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf kube_statefulset_spec_replicas{statefulset="%s", cluster="%s"} $labels.statefulset $labels.cluster |
| k8s-Statefulset_Replicas_not_Updated | 15m | warning | k8s | Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment |
| k8s-ReplicaSet_Replicas_Mismatch | 15m | warning | k8s | ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas |
| k8s-Deployment_Replicas_not_Ready | 15m | warning | k8s | Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }} |
| k8s-Deployment_Replicas_Mismatch | 15m | warning | k8s | Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf kube_deployment_spec_replicas{deployment="%s", cluster="%s"} $labels.deployment $labels.cluster |
| k8s-Deployment_Generation_Mismatch | 15m | warning | k8s | Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" generation does not match expected one |
| k8s-Deployment_Replicas_not_Updated | 15m | warning | k8s | Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment |
| k8s-Deployment_Rollout_Stuck | 15m | warning | k8s | Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" is not progressing for longer than 15 minutes. |
| k8s-Daemonset_Rollout_Stuck | 15m | warning | k8s | Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has less than 100% of desired pods scheduled and ready. Value: {{ humanize $value }}% |
| k8s-Daemonset_not_Scheduled | 15m | warning | k8s | Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has unscheduled pods. Value: {{ humanize $value }} |
| k8s-Daemonset_Misscheduled | 15m | warning | k8s | Daemonset pods {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" are running where they are not supposed to. Value: {{ humanize $value }} |
| k8s-Daemonset_Generation_Mismatch | 15m | warning | k8s | Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" generation does not match expected one |
| k8s-Pod_Resource_Allocation_Requests_Too_Low_CPU | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 80% the resource requests. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_Low_CPU | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 90% the resource requests. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_Low_CPU | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 80% the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_Low_CPU | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 90% the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_Low_Memory | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 80% of the resource requets. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_Low_Memory | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 90% of the resource requests. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_Low_Memory | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 80% of the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_Low_Memory | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 90% of the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_High_CPU | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 20% the resource requests. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_High_CPU | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 10% the resource requests. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_High_CPU | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 20% the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_High_CPU | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 10% the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_High_Memory | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 20% of the resource requets. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Requests_Too_High_Memory | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 10% of the resource requests. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_High_Memory | 15m | info | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 20% of the resource limits. Consider updating the allocated value. |
| k8s-Pod_Resource_Allocation_Limits_Too_High_Memory | 15m | warning | k8s | Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 10% of the resource limits. Consider updating the allocated value. |
rules/kubernetes_storage.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Persistent_Volume_Disk_Space_Usage_High | 5m | info | k8s | PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 80% used. Value: {{ printf "%0.2f" $value }}% |
| k8s-Persistent_Volume_Disk_Space_Usage_High | 5m | warning | k8s | PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 90% used. Value: {{ printf "%0.2f" $value }}% |
| k8s-Persistent_Volume_Full_in_4_days | 5m | info | k8s | PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" will fill up in 4 days at the current rate of utilization. Value: {{ printf "%0.2f" $value }}% available |
| k8s-Persistent_Volume_Inodes_Usage_High | 5m | info | k8s | PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 80% used. Value: {{ printf "%0.2f" $value }}% |
| k8s-Persistent_Volume_Inodes_Usage_High | 5m | warning | k8s | PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 90% used. Value: {{ printf "%0.2f" $value }}% |
| k8s-Persistent_Volume_Errors | 5m | warning | k8s | PersistentVolume "{{ $labels.persistentvolume }}" in cluster "{{ $labels.cluster }}" has status "{{ $labels.phase }}" |
| k8s-Persistent_Volume_Claim_Orphan | 3h | info | k8s | PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" is not used by any pod |
rules/kubernetes_storage_ephemeral.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| k8s-Ephemeral_Storage_Container_Usage_at_Limit | 5m | warning | k8s | Ephemeral storage usage of pod/container "{{ $labels.pod_name }}"/"{{ $labels.exported_container }}" in namespace "{{ $labels.pod_namespace }}" on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is at {{ $value }}% of the limit. |
| k8s-Ephemeral_Storage_Container_Usage_Reaching_Limit | 15m | warning | k8s | Ephemeral storage limit of pod/container "{{ $labels.pod_name }}"/"{{ $labels.exported_container }}" in namespace "{{ $labels.pod_namespace }}" on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is expected to be reached within 12 hours. Currently, {{ $value }}% is used. |
| k8s-Ephemeral_Storage_Volume_Filled_Up | 5m | warning | k8s | Ephemeral storage volume "{{ $labels.volume_name }}" of pod "{{ $labels.pod_name }}" in namespace "{{ $labels.pod_namespace }}" in cluster "{{ $labels.cluster }}" high usage. Value: {{ $value }}% |
| k8s-Ephemeral_Storage_Volume_Filling_Up | 5m | warning | k8s | Ephemeral storage volume "{{ $labels.volume_name }}" of pod "{{ $labels.pod_name }}" in namespace "{{ $labels.pod_namespace }}" in cluster "{{ $labels.cluster }}" is expected to be filled up within 12 hours. Currently, {{ $value }}% is used |
| k8s-Ephemeral_Storage_on_Node_Filling_Up | 5m | warning | k8s | Ephemeral storage on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is is gearter than 75%. Value: {{ $value }}% |
| k8s-Ephemeral_Storage_on_Node_Filling_Up | 5m | warning | k8s | Ephemeral storage on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is is gearter than 90%. Value: {{ $value }}% |
rules/longhorn.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Longhorn-Volume_Status_Critical | 5m | error | storage | Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Faulted" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted |
| Longhorn-Volume_Status_Warning | 5m | warning | storage | Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Degraded" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted |
| Longhorn-Volume_Status_Unknown | 5m | warning | storage | Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Unknown" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted |
| Longhorn-Node_Storage_Warning | 5m | warning | storage | The used storage of node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value |
| Longhorn-Disk_Storage_Warning | 5m | warning | storage | The used storage of disk "{{ $labels.disk }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value |
| Longhorn-Node_Down | 5m | error | storage | There are "{{ $value |
| Longhorn-Instance_Manager_CPU_Usage_Warning | 5m | info | storage | Longhorn instance manager "{{ $labels.instance_manager }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU request at "{{ $value |
| Longhorn-Node_CPU_Usage_Warning | 5m | info | storage | Longhorn node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU capacity at "{{ $value |
rules/metallb.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| MetalLB-BGP_Session_Down | 5m | error | network | MetalLB speaker {{ $labels.instance }} in cluster "{{ $labels.cluster}}" has BGP session {{ $labels.peer }} down for more than 5 minutes. |
| MetalLB-BGP_All_Sessions_Down | 5m | critical | network | MetalLB in "{{ $labels.cluster}}" all {{ $value }} BGP sessions are down for more than 5 minutes. |
| MetalLB-Address_Pool_High_Usage | 5m | info | network | MetalLB pool "{{ $labels.pool }}" in cluster "{{ $labels.cluster}}" has more than 75% of the total addresses used. |
| MetalLB-Config_Stale | 5m | warning | network | MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" has a stale config. |
| MetalLB-Config_not_Loaded | 5m | warning | network | MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" config not loaded. |
rules/monitoring_stack_components.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Monitoring-Prometheus_Bad_Config | 10m | critical | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to load its configuration. |
| Monitoring-Prometheus_SD_Refresh_Failure | 20m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to refresh SD with mechanism "{{ $labels.mechanism }}". |
| Monitoring-Prometheus_Kubernetes_List_Watch_Failures | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" service discovery is experiencing failures with LIST/WATCH requests to the Kubernetes API in the last 5 minutes. Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Notification_Queue_Running_Full | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" alert notification queue predicted to run full in less than 30 minutes. |
| Monitoring-Prometheus_Error_Sending_Alerts_to_Some_Alertmanagers | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has {{ printf "%.1f" $value }}% of alerts sent to Alertmanager "{{ $labels.alertmanager }}" affected by errors. |
| Monitoring-Prometheus_not_Connected_to_Alertmanagers | 10m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is not connected to any Alertmanagers. |
| Monitoring-Prometheus_TSDB_Reloads_Failing | 4h | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" experienced TSDB reload failures in the last 3h. Value: {{ $value }} |
| Monitoring-Prometheus_TSDB_Compactions_Failing | 4h | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" experienced compaction failures in the last 3h. Value: {{ $value }} |
| Monitoring-Prometheus_not_Ingesting_Samples | 10m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is not ingesting samples. |
| Monitoring-Prometheus_Duplicate_Timestamps | 10m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is dropping samples/s with different values but duplicated timestamps. Value: {{ printf "%.4g" $value }} |
| Monitoring-Prometheus_Out_of_Order_Timestamps | 10m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is dropping samples/s with timestamps arriving out of order. Value: {{ printf "%.4g" $value }} |
| Monitoring-Prometheus_Remote_Storage_Failures | 15m | critical | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to send a high number of samples to "{{ $labels.remote_name }}:{{ $labels.url }}". Value: {{ printf "%.1f" $value }}% |
| Monitoring-Prometheus_Remote_Write_Behind | 15m | critical | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" remote write is behind for "{{ $labels.remote_name }}:{{ $labels.url }}". Value: {{ printf "%.1f" $value }}s |
| Monitoring-Prometheus_Remote_Write_Desired_Shards | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" remote write desired shards calculation wants to run {{ $value }} shards for queue "{{ $labels.remote_name}}:{{ $labels.url }}", which is more than the max of "{{ printf prometheus_remote_storage_shards_max{instance="%s"} $labels.instance |
| Monitoring-Prometheus_Rule_Failures | 15m | critical | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to evaluate rules in the last 5m in rule group "{{ $labels.rule_group }}". Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Missing_Rule_Evaluations | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" missed rule group evaluations in the last 5m in rule group "{{ $labels.rule_group }}". Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Target_Limit_Hit | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" dropped targets because the number of targets exceeds the configured target_limit. Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Label_Limit_Hit | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" dropped targets because some samples exceeded the configured label_limit, label_name_length_limit or label_value_length_limit. Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Scrape_Body_Size_Limit_Hit | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed scrapes in the last 5m because some targets exceeded the configured body_size_limit. Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Scrape_Sample_Limit_Hit | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed scrapes in the last 5m because some targets exceeded the configured sample_limit. Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_Target_Sync_Failure | 5m | critical | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" targets failed to sync because invalid configuration was supplied. Value: {{ printf "%.0f" $value }} |
| Monitoring-Prometheus_High_Query_Load | 15m | warning | monitoring | Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" query API has less than 20% available capacity in its query engine for the last 15 minutes. |
| Monitoring-PrometheusOperator_List_Errors | 15m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has errors while performing "List" operations. |
| Monitoring-PrometheusOperator_Watch_Errors | 15m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has errors while performing "Watch" operations. |
| Monitoring-PrometheusOperator_Sync_Failed | 10m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed sync operations for {{ $value }} objects. |
| Monitoring-PrometheusOperator_Reconcile_Errors | 10m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed reconciling operations. Value: {{ $value |
| Monitoring-PrometheusOperator_Status_Update_Errors | 10m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed status update operations. Value: {{ $value |
| Monitoring-PrometheusOperator_Node_Lookup_Errors | 10m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" errors while reconciling Prometheus. |
| Monitoring-PrometheusOperator_Not_Ready | 5m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" is not ready to reconcile resources. |
| Monitoring-PrometheusOperator_Rejected_Resources | 5m | warning | monitoring | Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" rejected "{{ $labels.resource }}" resources. Value: {{ printf "%0.0f" $value }} |
| Monitoring-Alertmanager_Failed_Reload | 10m | critical | monitoring | Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to load its configuration. |
| Monitoring-Alertmanager_Members_Inconsistent | 15m | critical | monitoring | Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has only found {{ $value }} members of the Alertmanager cluster. |
| Monitoring-Alertmanager_Failed_to_Send_Alerts | 5m | warning | monitoring | Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to send {{ $value |
| Monitoring-Alertmanager_Cluster_Failed_to_Send_Alerts | 5m | critical | monitoring | Alertmanager in cluster "{{ $labels.cluster }}" has high notification failure rate to "{{ $labels.integration }}". Value: {{ $value |
| Monitoring-Alertmanager_Config_Inconsistent | 20m | critical | monitoring | Alertmanager instances in cluster "{{ $labels.cluster }}" have different configurations. |
rules/node_exporter.yml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Node-Recently_Rebooted | 0m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" has been rebooted in the last 30 minutes. Value: {{ humanizeDuration $value }} uptime. |
| Node-CPU_High_Usage | 30m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU usage has exceeded the threshold of 90% for more than 30 minutes. Value: {{ humanize $value }} |
| Node-CPU_High_steal | 15m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }} |
| Node-CPU_High_steal | 15m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }} |
| Node-CPU_High_iowait | 15m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }} |
| Node-CPU_High_iowait | 15m | error | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }} |
| Node-Memory_Major_Pages_Faults | 15m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - memory major page faults are occurring at very high rate. Value: {{ humanize $value }} |
| Node-Memory_High_Usage | 10m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% memory used for more than 10 minutes. Value: {{ humanize $value }} |
| Node-Memory_High_Usage | 10m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 90% memory used for more than 10 minutes. Value: {{ humanize $value }} |
| Node-Disk_Space_High_Usage | 15m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 80% used disk space for more than 15m. Value: {{ humanize $value }} |
| Node-Disk_Space_High_Usage | 15m | error | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 90% used disk space for more than 15m. Value: {{ humanize $value }} |
| Node-Disk_Will_Fill_Up_In_4h | 5m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" will fill up in 4 hours at the current rate of utilization. Value: {{ printf `node_filesystem_avail_bytes{fstype=~"ext.* |
| Node-High_Disk_Inodes_High_Usage | 15m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 60% used inodes for more than 15m. Value: {{ humanize $value }} |
| Node-High_Disk_Inodes_High_Usage | 15m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 70% used inodes for more than 15m. Value: {{ humanize $value }} |
| Node-Load_High | 15m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - system load per core is above 2 for the last 15 minutes. This might indicate this instance resources saturation and can cause it becoming unresponsive. Value: {{ humanize $value }} |
| Node-fds_Near_Limit_Process | 5m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - "{{ $labels.job }}" has more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }} |
| Node-fds_Near_Limit | 5m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }} |
| Node-Network_High_Receive_Drop | 0m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network reception. Value: {{ humanize $value }} |
| Node-Network_High_Transmit_Drop | 0m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network transmission. Value: {{ humanize $value }} |
| Node-Network_High_Receive_Errors | 30m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered receive errors. Value: {{ humanize $value }} |
| Node-Network_High_Transmit_Errors | 30m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered transmit errors. Value: {{ humanize $value }} |
| Node-Network_Interface_Flapping | 0m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" changing its up status often. Value: {{ humanize $value }} |
| Node-Network_Bond_Interface_Misconfigured | 5m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - bond "{{ $labels.master }}" is misconfigured. Check bonding slaves configuration. |
| Node-Network_Bond_Interface_Down | 5m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - bond "{{ $labels.master }}" has interface(s) down. Value: {{ $value }} |
| Node-Too_Many_OOM_Kills | 0m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - several OOM kills detected in the past 1h. Value: {{ humanize $value }}. Find out which process by running `dmesg |
| Node-Clock_Not_Synchronising | 10m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is not synchronising. Ensure NTP is configured on this host. |
| Node-Clock_Skew_Detected | 10m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host. |
| Node-Host_Conntrack_Limit | 10m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 75% of conntrack entries are used. |
| Node-EDAC_Correctable_Errors_Detected | 0m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Correctable Errors detected. |
| Node-EDAC_Uncorrectable_Errors_Detected | 0m | error | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Uncorrectable Errors detected. |
| Node-Filesystem_Device_Error | 0m | warning | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" filesystem error. |
| Node-Disk_Queue_Length_High | 10m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has disk queue length greater than 1 for more than 10 minutes. Value: {{ humanize $value }} |
| Node-Disk_IO_Time_Weighted_Seconds | 10m | info | system | Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has high disk io queue (aqu-sq) for more than 10 minutes. Value: {{ humanize $value }} |
Configuration
-
If you want to rebase/retry this MR, check this box
This MR has been generated by Renovate Bot Sylva instance.
CI configuration couldn't be handle by MR description. A dedicated comment has been posted to control it.
If no checkbox is checked, a default pipeline will be enabled (capm3, or capo if capo label is set)