Update Sylva-elements (release-1.5) (patch)

This MR contains the following updates:

Package Type Update Change
https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git patch 0.2.4 -> 0.2.5
registry.gitlab.com/sylva-projects/sylva-elements/container-images/helm-toolbox image patch 1.1.0 -> 1.1.2
registry.gitlab.com/sylva-projects/sylva-elements/container-images/openstack-client patch v0.1.1 -> v0.1.4
sylva-library patch 0.6.3 -> 0.6.4
sylva-projects/sylva-elements/ci-tooling/ci-templates repository patch 1.0.50 -> 1.0.51

⚠️ Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Release Notes

sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules.git)

v0.2.5: sylva-thanos-rules: 0.2.5

Compare Source

Merge Requests integrated in this release

1 merge request was integrated in this repo between 0.2.4 and 0.2.5. These notes don't account for the MRs merged in secondary repos.

Other

  • [backport-0.2 / sylva-1.5] Update severity for node rebooted rule !126

Contributors

1 person contributed.

Alin H

sylva-thanos-rules

Generate ConfigMap object for consumption by Thanos Ruler
Details about rules
rules/_helper_kubernetes_metadata.yml
Alert Name For Severity Type Description
k8s-Metamonitoring_configuration_error_kube_namespace_labels 45m error deployment Metric "kube_namespace_labels" from cluster "{{ $labels.capi_cluster_name }}" is not exposed by "kube-state-metrics".
k8s-Metamonitoring_configuration_error_rancher_project_info 45m error deployment Metric "rancher_project_info" from the management cluster is not exposed by "kube-state-metrics".
rules/clusters_state_rules.yml
Alert Name For Severity Type Description
Sylva_cluster_Prometheus_not_Sending_Data_management 45m critical deployment Prometheus server from the management cluster has not sent data in the last 45m.
Sylva_cluster_Prometheus_not_Sending_Data 45m critical deployment Prometheus server from cluster "{{ $labels.capi_cluster_name }}" in namespace "{{ $labels.capi_cluster_namespace }}" has not sent data in the last 45m.
Sylva_clusters_different_number 45m critical deployment Some cluster is not properly provisioned in Rancher, check all clusters to see if cattle-agent is properly deployed
Sylva_clusters_metric_absent 45m error deployment Metric "capi_cluster_info" from the management cluster is not exposed by "kube-state-metrics".
rules/etcd.yml
Alert Name For Severity Type Description
etcd-Members_Down 5m warning etcd etcd in cluster "{{ $labels.cluster }}" members are down.
etcd-Members_Insufficient 5m warning etcd etcd in cluster "{{ $labels.cluster }}" has insufficient members. Value: {{ $value }}
etcd-Members_No_Leader 5m warning etcd etcd in cluster "{{ $labels.cluster }}" member {{ $labels.instance }} has no leader.
etcd-High_Number_of_Leader_Changes 5m warning etcd etcd in cluster "{{ $labels.cluster }}" {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.
etcd-gRPC_High_Number_of_Failed_Requests 10m warning etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" gRPC requests failed for "{{ $labels.grpc_method }}". Value: {{ $value }}
etcd-gRPC_High_Number_of_Failed_Requests 5m critical etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" gRPC requests failed for "{{ $labels.grpc_method }}". Value: {{ $value }}
etcd-Members_Communication_Slow 10m warning etcd etcd in cluster "{{ $labels.cluster }}" "{{ $labels.instance }}" to "{{ $labels.To }}" member communication is taking too long. Value: {{ $value }}s.
etcd-High_Number_of_Failed_Proposals 15m warning etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has proposal failures within the last 30 minutes. Value: {{ $value }}
etcd-High_Fsync_Duration 10m warning etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has high 99th percentile fync durations. Value: {{ $value }}s
etcd-High_Commit_Duration 10m warning etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" has high 99th percentile commit durations. Value: {{ $value }}s
etcd-HTTP_High_Number_of_Failed_Requests 10m warning etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests failed for "{{ $labels.method }}". Value: {{ $value }}
etcd-HTTP_High_Number_of_Failed_Requests 10m critical etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests failed for "{{ $labels.method }}". Value: {{ $value }}
etcd-HTTP_Requests_Slow 10m warning etcd etcd in cluster "{{ $labels.cluster }}" instance "{{ $labels.instance }}" HTTP requests for "{{ $labels.method }}" are slow. Value: {{ $value }}s
rules/kubernetes_capacity.yml
Alert Name For Severity Type Description
k8s-Cluster_CPU_Overcommitted 5m warning k8s Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable CPUs. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Memory_Overcommitted 5m warning k8s Kubernetes cluster "{{ $labels.cluster }}" has allocated over 75% of the allocatable Memory. Node failures may cause Pods to be unschadulable due to lack of resources
k8s-Cluster_Too_Many_Pods 15m warning k8s Kubernetes cluster "{{ $labels.cluster }}" number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Node_Too_Many_Pods 15m warning k8s Kubernetes cluster "{{ $labels.cluster }}" node {{ $labels.node }} number of pods over 90% of Pod number limit. Value: {{ humanize $value }}%
k8s-Kube_Quota_Almost_Full 15m warning k8s Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value
k8s-Kube_Quota_Exceeded 15m error k8s Namespace {{ $labels.namespace }} in cluster {{ $labels.cluster }} is using {{ $value
rules/kubernetes_cluster_components.yml
Alert Name For Severity Type Description
k8s-Version_Mismatch 4h warning k8s Kubernetes cluster "{{ $labels.cluster }}" has different versions of Kubernetes components running. Value: {{ $value }}
k8s-Client_Errors 15m warning k8s Kubernetes cluster "{{ $labels.cluster }}" API server client "{{ $labels.instance }}" job "{{ $labels.job }}" is experiencing errors. Value: {{ printf "%0.0f" $value }}%
k8s-Client_Certificate_Expiration 5m warning k8s A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 7 days on cluster {{ $labels.cluster }}.
k8s-Client_Certificate_Expiration 5m critical k8s A client certificate used to authenticate to Kubernetes apiserver is expiring in less than 24h on cluster {{ $labels.cluster }}.
k8s-API_Global_Error_Rate_High 15m warning k8s Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for over 3% of requests. Value: {{ humanize $value }}%
k8s-API_Error_Rate_High 15m warning k8s Kubernetes cluster "{{ $labels.cluster }}" API server is returning errors for 10% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. Value: {{ humanize $value }}%
k8s-Aggregated_API_Errors 15m warning k8s Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has reported errors. It has appeared unavailable {{ $value
k8s-Aggregated_API_Down 15m warning k8s Kubernetes aggregated API i{{ $labels.name }}/{{ $labels.namespace }} in cluster "{{ $labels.cluster }}" has been only {{ $value
k8s-API_Endpoint_Down 15m error k8s Kubernetes API endpoint {{ $labels.instance }} in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Down 15m critical k8s Kubernetes API in cluster {{ $labels.cluster }} is unreachable.
k8s-API_Terminated_Requests 15m warning k8s Kubernetes API in cluster {{ $labels.cluster }} has terminated {{ $value
rules/kubernetes_jobs.yml
Alert Name For Severity Type Description
k8s-CronJob_Status_Failed 5m warning k8s CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" failed. Last job has failed multiple times. Value: {{ $value }}
k8s-CronJob_Taking_Too_Long 0m warning k8s CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} in cluster "{{ $labels.cluster }}" is taking too long to completes - is over deadline. Value: {{ humanizeDuration $value }}
k8s-Job_not_Completed 15m warning k8s Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" is taking more than 12h to complete.
k8s-Job_Failed 15m warning k8s Job {{ $labels.namespace }}/{{ $labels.job_name }} in cluster "{{ $labels.cluster }}" failed to complete. Removing failed job after investigation should clear this alert.
rules/kubernetes_nodes.yml
Alert Name For Severity Type Description
k8s-Node_Kubelet_Down 5m critical k8s Kubelet on {{ $labels.node }} in cluster "{{ $labels.cluster }}" is not reachable
k8s-Node_Status_OutOfDisk 5m warning k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is almost out of disk space
k8s-Node_Status_MemoryPressure 5m warning k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under memory pressure.
k8s-Node_Status_DiskPressure 5m warning k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under disk pressure
k8s-Node_Status_PIDPressure 5m warning k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" is under PID pressure
k8s-Node_Status_NotReady 5m error k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has been not been Ready for more than an hour
k8s-Node_Status_NetworkUnavailable 5m warning k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" has NetworkUnavailable condition.
k8s-Node_Status_Ready_flapping 5m warning k8s Node {{ $labels.node }} in cluster "{{ $labels.cluster }}" readiness status changed {{ $value }} times in the last 15 minutes.
rules/kubernetes_pods.yml
Alert Name For Severity Type Description
k8s-Pod_Status_not_Ready 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in a non-ready state for longer than 15 minutes.
k8s-Pod_Status_OOMKilled 0m error k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" has been restarted due to OOMKilled reason in the last hour. Value: {{ humanize $value }}
k8s-Pod_Status_Crashlooping 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster}}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Init_Container_Status_Crashlooping 15m warning k8s Init Container from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" was restarted more than 5 times within the last hour. Value: {{ humanize $value }}
k8s-Pod_Container_Status_Waiting 1h warning k8s Container "{{ $labels.container }}" from Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" on node "{{ $labels.node }}" has been in waiting state for longer than 1 hour.
k8s-Statefulset_Replicas_not_Ready 15m warning k8s Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Statefulset_Generation_Mismatch 15m warning k8s Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" generation for does not match. This indicates that the StatefulSet has failed but has not been rolled back
k8s-Statefulset_Update_not_Rolled_Out 15m warning k8s Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" update has not been rolled out
k8s-Statefulset_Replicas_Mismatch 15m warning k8s Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf kube_statefulset_spec_replicas{statefulset="%s", cluster="%s"} $labels.statefulset $labels.cluster
k8s-Statefulset_Replicas_not_Updated 15m warning k8s Statefulset {{ $labels.namespace }}/{{ $labels.statefulset }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-ReplicaSet_Replicas_Mismatch 15m warning k8s ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas
k8s-Deployment_Replicas_not_Ready 15m warning k8s Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has replicas in "notReady" state. Value: {{ humanize $value }}
k8s-Deployment_Replicas_Mismatch 15m warning k8s Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" has not matched the expected number of replicas. Value: {{ $value }} / {{ printf kube_deployment_spec_replicas{deployment="%s", cluster="%s"} $labels.deployment $labels.cluster
k8s-Deployment_Generation_Mismatch 15m warning k8s Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" generation does not match expected one
k8s-Deployment_Replicas_not_Updated 15m warning k8s Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" replicas are not updated and available for deployment
k8s-Deployment_Rollout_Stuck 15m warning k8s Deployment {{ $labels.namespace }}/{{ $labels.deployment }} in cluster "{{ $labels.cluster }}" is not progressing for longer than 15 minutes.
k8s-Daemonset_Rollout_Stuck 15m warning k8s Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has less than 100% of desired pods scheduled and ready. Value: {{ humanize $value }}%
k8s-Daemonset_not_Scheduled 15m warning k8s Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" has unscheduled pods. Value: {{ humanize $value }}
k8s-Daemonset_Misscheduled 15m warning k8s Daemonset pods {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" are running where they are not supposed to. Value: {{ humanize $value }}
k8s-Daemonset_Generation_Mismatch 15m warning k8s Daemonset {{ $labels.namespace }}/{{ $labels.daemonset }} in cluster "{{ $labels.cluster }}" generation does not match expected one
k8s-Pod_Resource_Allocation_Requests_Too_Low_CPU 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 80% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_Low_CPU 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 90% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_CPU 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 80% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_CPU 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is greater than 90% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_Low_Memory 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 80% of the resource requets. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_Low_Memory 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 90% of the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_Memory 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 80% of the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_Low_Memory 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is greater than 90% of the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_CPU 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 20% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_CPU 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 10% the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_CPU 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 20% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_CPU 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" CPU utilization is smaller than 10% the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_Memory 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 20% of the resource requets. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Requests_Too_High_Memory 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 10% of the resource requests. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_Memory 15m info k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 20% of the resource limits. Consider updating the allocated value.
k8s-Pod_Resource_Allocation_Limits_Too_High_Memory 15m warning k8s Pod {{ $labels.namespace }}/{{ $labels.pod }} in cluster "{{ $labels.cluster }}" memory utilization is smaller than 10% of the resource limits. Consider updating the allocated value.
rules/kubernetes_storage.yml
Alert Name For Severity Type Description
k8s-Persistent_Volume_Disk_Space_Usage_High 5m info k8s PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Disk_Space_Usage_High 5m warning k8s PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" disk space usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Full_in_4_days 5m info k8s PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" will fill up in 4 days at the current rate of utilization. Value: {{ printf "%0.2f" $value }}% available
k8s-Persistent_Volume_Inodes_Usage_High 5m info k8s PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 80% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Inodes_Usage_High 5m warning k8s PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" inodes usage is over 90% used. Value: {{ printf "%0.2f" $value }}%
k8s-Persistent_Volume_Errors 5m warning k8s PersistentVolume "{{ $labels.persistentvolume }}" in cluster "{{ $labels.cluster }}" has status "{{ $labels.phase }}"
k8s-Persistent_Volume_Claim_Orphan 3h info k8s PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} in cluster "{{ $labels.cluster }}" is not used by any pod
rules/kubernetes_storage_ephemeral.yml
Alert Name For Severity Type Description
k8s-Ephemeral_Storage_Container_Usage_at_Limit 5m warning k8s Ephemeral storage usage of pod/container "{{ $labels.pod_name }}"/"{{ $labels.exported_container }}" in namespace "{{ $labels.pod_namespace }}" on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is at {{ $value }}% of the limit.
k8s-Ephemeral_Storage_Container_Usage_Reaching_Limit 15m warning k8s Ephemeral storage limit of pod/container "{{ $labels.pod_name }}"/"{{ $labels.exported_container }}" in namespace "{{ $labels.pod_namespace }}" on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is expected to be reached within 12 hours. Currently, {{ $value }}% is used.
k8s-Ephemeral_Storage_Volume_Filled_Up 5m warning k8s Ephemeral storage volume "{{ $labels.volume_name }}" of pod "{{ $labels.pod_name }}" in namespace "{{ $labels.pod_namespace }}" in cluster "{{ $labels.cluster }}" high usage. Value: {{ $value }}%
k8s-Ephemeral_Storage_Volume_Filling_Up 5m warning k8s Ephemeral storage volume "{{ $labels.volume_name }}" of pod "{{ $labels.pod_name }}" in namespace "{{ $labels.pod_namespace }}" in cluster "{{ $labels.cluster }}" is expected to be filled up within 12 hours. Currently, {{ $value }}% is used
k8s-Ephemeral_Storage_on_Node_Filling_Up 5m warning k8s Ephemeral storage on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is is gearter than 75%. Value: {{ $value }}%
k8s-Ephemeral_Storage_on_Node_Filling_Up 5m warning k8s Ephemeral storage on node "{{ $labels.node_name }}" in cluster "{{ $labels.cluster }}" is is gearter than 90%. Value: {{ $value }}%
rules/longhorn.yml
Alert Name For Severity Type Description
Longhorn-Volume_Status_Critical 5m error storage Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Faulted" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Warning 5m warning storage Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Degraded" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Volume_Status_Unknown 5m warning storage Longhorn volume "{{ $labels.volume }} / {{ $labels.pvc }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is in state "Unknown" for more than 5 minutes. Possible values are: 0=unknown, 1=healthy, 2=degraded, 3=faulted
Longhorn-Node_Storage_Warning 5m warning storage The used storage of node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Disk_Storage_Warning 5m warning storage The used storage of disk "{{ $labels.disk }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" is at "{{ $value
Longhorn-Node_Down 5m error storage There are "{{ $value
Longhorn-Instance_Manager_CPU_Usage_Warning 5m info storage Longhorn instance manager "{{ $labels.instance_manager }}" on node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU request at "{{ $value
Longhorn-Node_CPU_Usage_Warning 5m info storage Longhorn node "{{ $labels.node }}" in cluster "{{ $labels.cluster }}" has CPU Usage / CPU capacity at "{{ $value
rules/metallb.yml
Alert Name For Severity Type Description
MetalLB-BGP_Session_Down 5m error network MetalLB speaker {{ $labels.instance }} in cluster "{{ $labels.cluster}}" has BGP session {{ $labels.peer }} down for more than 5 minutes.
MetalLB-BGP_All_Sessions_Down 5m critical network MetalLB in "{{ $labels.cluster}}" all {{ $value }} BGP sessions are down for more than 5 minutes.
MetalLB-Address_Pool_High_Usage 5m info network MetalLB pool "{{ $labels.pool }}" in cluster "{{ $labels.cluster}}" has more than 75% of the total addresses used.
MetalLB-Config_Stale 5m warning network MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" has a stale config.
MetalLB-Config_not_Loaded 5m warning network MetalLB {{ $labels.instance }} container "{{ $labels.container }}" in cluster "{{ $labels.cluster}}" config not loaded.
rules/monitoring_stack_components.yml
Alert Name For Severity Type Description
Monitoring-Prometheus_Bad_Config 10m critical monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to load its configuration.
Monitoring-Prometheus_SD_Refresh_Failure 20m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to refresh SD with mechanism "{{ $labels.mechanism }}".
Monitoring-Prometheus_Kubernetes_List_Watch_Failures 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" service discovery is experiencing failures with LIST/WATCH requests to the Kubernetes API in the last 5 minutes. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Notification_Queue_Running_Full 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" alert notification queue predicted to run full in less than 30 minutes.
Monitoring-Prometheus_Error_Sending_Alerts_to_Some_Alertmanagers 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has {{ printf "%.1f" $value }}% of alerts sent to Alertmanager "{{ $labels.alertmanager }}" affected by errors.
Monitoring-Prometheus_not_Connected_to_Alertmanagers 10m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is not connected to any Alertmanagers.
Monitoring-Prometheus_TSDB_Reloads_Failing 4h warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" experienced TSDB reload failures in the last 3h. Value: {{ $value }}
Monitoring-Prometheus_TSDB_Compactions_Failing 4h warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" experienced compaction failures in the last 3h. Value: {{ $value }}
Monitoring-Prometheus_not_Ingesting_Samples 10m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is not ingesting samples.
Monitoring-Prometheus_Duplicate_Timestamps 10m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is dropping samples/s with different values but duplicated timestamps. Value: {{ printf "%.4g" $value }}
Monitoring-Prometheus_Out_of_Order_Timestamps 10m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" is dropping samples/s with timestamps arriving out of order. Value: {{ printf "%.4g" $value }}
Monitoring-Prometheus_Remote_Storage_Failures 15m critical monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to send a high number of samples to "{{ $labels.remote_name }}:{{ $labels.url }}". Value: {{ printf "%.1f" $value }}%
Monitoring-Prometheus_Remote_Write_Behind 15m critical monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" remote write is behind for "{{ $labels.remote_name }}:{{ $labels.url }}". Value: {{ printf "%.1f" $value }}s
Monitoring-Prometheus_Remote_Write_Desired_Shards 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" remote write desired shards calculation wants to run {{ $value }} shards for queue "{{ $labels.remote_name}}:{{ $labels.url }}", which is more than the max of "{{ printf prometheus_remote_storage_shards_max{instance="%s"} $labels.instance
Monitoring-Prometheus_Rule_Failures 15m critical monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to evaluate rules in the last 5m in rule group "{{ $labels.rule_group }}". Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Missing_Rule_Evaluations 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" missed rule group evaluations in the last 5m in rule group "{{ $labels.rule_group }}". Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Target_Limit_Hit 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" dropped targets because the number of targets exceeds the configured target_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Label_Limit_Hit 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" dropped targets because some samples exceeded the configured label_limit, label_name_length_limit or label_value_length_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Scrape_Body_Size_Limit_Hit 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed scrapes in the last 5m because some targets exceeded the configured body_size_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Scrape_Sample_Limit_Hit 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed scrapes in the last 5m because some targets exceeded the configured sample_limit. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_Target_Sync_Failure 5m critical monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" targets failed to sync because invalid configuration was supplied. Value: {{ printf "%.0f" $value }}
Monitoring-Prometheus_High_Query_Load 15m warning monitoring Prometheus "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" query API has less than 20% available capacity in its query engine for the last 15 minutes.
Monitoring-PrometheusOperator_List_Errors 15m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has errors while performing "List" operations.
Monitoring-PrometheusOperator_Watch_Errors 15m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has errors while performing "Watch" operations.
Monitoring-PrometheusOperator_Sync_Failed 10m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed sync operations for {{ $value }} objects.
Monitoring-PrometheusOperator_Reconcile_Errors 10m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed reconciling operations. Value: {{ $value
Monitoring-PrometheusOperator_Status_Update_Errors 10m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" has failed status update operations. Value: {{ $value
Monitoring-PrometheusOperator_Node_Lookup_Errors 10m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" errors while reconciling Prometheus.
Monitoring-PrometheusOperator_Not_Ready 5m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" is not ready to reconcile resources.
Monitoring-PrometheusOperator_Rejected_Resources 5m warning monitoring Controller "{{ $labels.controller }}" in cluster "{{ $labels.cluster }}" rejected "{{ $labels.resource }}" resources. Value: {{ printf "%0.0f" $value }}
Monitoring-Alertmanager_Failed_Reload 10m critical monitoring Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has failed to load its configuration.
Monitoring-Alertmanager_Members_Inconsistent 15m critical monitoring Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" has only found {{ $value }} members of the Alertmanager cluster.
Monitoring-Alertmanager_Failed_to_Send_Alerts 5m warning monitoring Alertmanager "{{ $labels.pod }}" in cluster "{{ $labels.cluster }}" failed to send {{ $value
Monitoring-Alertmanager_Cluster_Failed_to_Send_Alerts 5m critical monitoring Alertmanager in cluster "{{ $labels.cluster }}" has high notification failure rate to "{{ $labels.integration }}". Value: {{ $value
Monitoring-Alertmanager_Config_Inconsistent 20m critical monitoring Alertmanager instances in cluster "{{ $labels.cluster }}" have different configurations.
rules/node_exporter.yml
Alert Name For Severity Type Description
Node-Recently_Rebooted 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" has been rebooted in the last 30 minutes. Value: {{ humanizeDuration $value }} uptime.
Node-CPU_High_Usage 30m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU usage has exceeded the threshold of 90% for more than 30 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal 15m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_steal 15m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %steal has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait 15m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 10% for more than 15 minutes. Value: {{ humanize $value }}
Node-CPU_High_iowait 15m error system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - CPU %iowait has exceeded the threshold of 30% for more than 15 minutes. Value: {{ humanize $value }}
Node-Memory_Major_Pages_Faults 15m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - memory major page faults are occurring at very high rate. Value: {{ humanize $value }}
Node-Memory_High_Usage 10m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Memory_High_Usage 10m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 90% memory used for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage 15m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 80% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Space_High_Usage 15m error system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 90% used disk space for more than 15m. Value: {{ humanize $value }}
Node-Disk_Will_Fill_Up_In_4h 5m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" will fill up in 4 hours at the current rate of utilization. Value: {{ printf `node_filesystem_avail_bytes{fstype=~"ext.*
Node-High_Disk_Inodes_High_Usage 15m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 60% used inodes for more than 15m. Value: {{ humanize $value }}
Node-High_Disk_Inodes_High_Usage 15m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - mountpoint "{{ $labels.mountpoint }}" has more than 70% used inodes for more than 15m. Value: {{ humanize $value }}
Node-Load_High 15m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - system load per core is above 2 for the last 15 minutes. This might indicate this instance resources saturation and can cause it becoming unresponsive. Value: {{ humanize $value }}
Node-fds_Near_Limit_Process 5m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - "{{ $labels.job }}" has more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-fds_Near_Limit 5m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 80% used file descriptors for more than 5m. Value: {{ humanize $value }}
Node-Network_High_Receive_Drop 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network reception. Value: {{ humanize $value }}
Node-Network_High_Transmit_Drop 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has high drop in network transmission. Value: {{ humanize $value }}
Node-Network_High_Receive_Errors 30m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered receive errors. Value: {{ humanize $value }}
Node-Network_High_Transmit_Errors 30m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" has encountered transmit errors. Value: {{ humanize $value }}
Node-Network_Interface_Flapping 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - interface "{{ $labels.device }}" changing its up status often. Value: {{ humanize $value }}
Node-Network_Bond_Interface_Misconfigured 5m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - bond "{{ $labels.master }}" is misconfigured. Check bonding slaves configuration.
Node-Network_Bond_Interface_Down 5m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - bond "{{ $labels.master }}" has interface(s) down. Value: {{ $value }}
Node-Too_Many_OOM_Kills 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - several OOM kills detected in the past 1h. Value: {{ humanize $value }}. Find out which process by running `dmesg
Node-Clock_Not_Synchronising 10m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is not synchronising. Ensure NTP is configured on this host.
Node-Clock_Skew_Detected 10m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - clock is out of sync by more than 0.05s. Ensure NTP is configured correctly on this host.
Node-Host_Conntrack_Limit 10m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - more than 75% of conntrack entries are used.
Node-EDAC_Correctable_Errors_Detected 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Correctable Errors detected.
Node-EDAC_Uncorrectable_Errors_Detected 0m error system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - EDAC Uncorrectable Errors detected.
Node-Filesystem_Device_Error 0m warning system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" filesystem error.
Node-Disk_Queue_Length_High 10m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has disk queue length greater than 1 for more than 10 minutes. Value: {{ humanize $value }}
Node-Disk_IO_Time_Weighted_Seconds 10m info system Node "{{ $labels.instance }}" in cluster "{{ $labels.cluster }}" - device "{{ $labels.device }}" has high disk io queue (aqu-sq) for more than 10 minutes. Value: {{ humanize $value }}
sylva-projects/sylva-elements/container-images/helm-toolbox (registry.gitlab.com/sylva-projects/sylva-elements/container-images/helm-toolbox)

v1.1.2: helm-toolbox: 1.1.2

Compare Source

Merge Requests integrated in this release

8 merge requests were integrated in this repo between 1.1.1 and 1.1.2. These notes don't account for the MRs merged in secondary repos.

Other dependency upgrades

  • Update alpine container to v3.22.2 !145
  • Update dependency helm-unittest/helm-unittest to v1.0.3 !143

CI

Contributors

0 person contributed.

v1.1.1: helm-toolbox: 1.1.1

Compare Source

Merge Requests integrated in this release

4 merge requests were integrated in this repo between 1.1.0 and 1.1.1. These notes don't account for the MRs merged in secondary repos.

Other dependency upgrades

  • Update dependency helm-unittest/helm-unittest to v1.0.1 !141
  • Update dependency mikefarah/yq to v4.47.2 !134

CI

  • Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.43 renovate !140 !142

Contributors

0 person contributed.

sylva-projects/sylva-elements/container-images/openstack-client (registry.gitlab.com/sylva-projects/sylva-elements/container-images/openstack-client)

v0.1.4: openstack-client: v0.1.4

Compare Source

Merge Requests integrated in this release

7 merge requests were integrated in this repo between v0.1.3 and v0.1.4. These notes don't account for the MRs merged in secondary repos.

OpenStack capo

Other dependency upgrades

  • Update sylva-elements/container-images/sylva-toolbox container to v1.4.6 !208 !212
  • Update alpine container to v3.23.3 !214 !209

CI

Contributors

0 person contributed.

v0.1.3: openstack-client: v0.1.3

Compare Source

Merge Requests integrated in this release

3 merge requests were integrated in this repo between v0.1.2 and v0.1.3. These notes don't account for the MRs merged in secondary repos.

Other dependency upgrades

  • Update sylva-elements/container-images/sylva-toolbox container to v1.1.10 !206

CI

Contributors

0 person contributed.

v0.1.2: openstack-client: v0.1.2

Compare Source

Merge Requests integrated in this release

10 merge requests were integrated in this repo between v0.1.1 and v0.1.2. These notes don't account for the MRs merged in secondary repos.

Other dependency upgrades

  • Update sylva-elements/container-images/sylva-toolbox container to v1.1.7 !195 !198 !199 !204
  • Update alpine container to v3.22.2 !201

CI

Contributors

0 person contributed.

sylva-projects/sylva-elements/helm-charts/sylva-library (sylva-library)

v0.6.4: sylva-library: 0.6.4

Compare Source

Merge Requests integrated in this release

2 merge requests were integrated in this repo between 0.6.0 and 0.6.4. These notes don't account for the MRs merged in secondary repos.

Baremetal capm3

  • [backport sylva 1.5 / 0.6] Correct network field names !150 backport

CI

  • Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.43 !128 renovate

Contributors

1 person contributed.

Priya Goyal

sylva-projects/sylva-elements/ci-tooling/ci-templates (sylva-projects/sylva-elements/ci-tooling/ci-templates)

v1.0.51: ci-templates: 1.0.51

Compare Source

Merge Requests integrated in this release

9 merge requests were integrated in this repo between 1.0.50 and 1.0.51. These notes don't account for the MRs merged in secondary repos.

Other dependency upgrades

CI

  • Update dependency to-be-continuous/gitleaks to v2.9.1 !233 renovate

Contributors

0 person contributed.


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻️ Rebasing: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

👻 Immortal: This MR will be recreated if closed unmerged. Get config help if that's undesired.


  • If you want to rebase/retry this MR, check this box

This MR has been generated by Renovate Bot Sylva instance.

CI configuration couldn't be handle by MR description. A dedicated comment has been posted to control it.

If no checkbox is checked, a default pipeline will be enabled (capm3, or capo if capo label is set)

Edited by Sylva Renovate bot

Merge request reports

Loading