Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git to v0.0.17 (release-1.3)
This MR contains the following updates:
| Package | Update | Change |
|---|---|---|
| https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git | patch |
0.0.15 -> 0.0.17
|
Release Notes
sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git)
v0.0.17: sylva-prometheus-rules: 0.0.17
Merge Requests integrated in this release
CI
- Update dependency renovate-bot/renovate-runner to v22 !96 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.40 !97 renovate
Other
- Add Harbor rule !98
Contributors
sylva-prometheus-rules
Generate PrometheusRule objects for consumption by Prometheus
Overview
There are two mechanisms that control which rules are deployed
-
createRulesselects which directories are considered -
optional_rulesselects which files in those directories are added to the Configmap
Rules overrides
.Values.createRules controls which cluster rules are checked and the keys represent the directories under alert-rules/
If .Values.createRules.allclusters is true (default) then the alert-rules/allclusters/*yaml rules are parsed last, regardless of what other
clusters are specified
This allows for rule overriding. Example:
createRules:
allclusters: true
management-cluster: true
alert-rules/allclusters/health-alerts.yaml
alert-rules/allclusters/dummy.yaml
alert-rules/management-cluster/flux.yaml
alert-rules/management-cluster/health-alerts.yaml
alert-rules/management-cluster/minio.yaml
- First the
PrometheusRulewith the flux, minio and health-alerts name frommanagement-clusterare created. - Then health-alerts and dummy from
allclusterare parsed. Since health-alerts is already applied frommananagement-clusterit will not be applied again. dummy will be applied since it doesn't override anything
This in effect allows the user to override the health-alerts from allclusters with health-alerts form management-cluster
Rules activation
.Values.optional_rules controls which rules are enabled for optional components
Details about rules
alert-rules/allclusters/health-alerts.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| KubeJobFailedAllClusters | 15m | warning | k8s | Job "{{ $labels.namespace }}"/ "{{ $labels.job_name }}" failed to complete. Removing failed job after investigation should clear this alert. |
alert-rules/allclusters/snmp-dell-idrac.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| SNMP_DELL_iDRAC_globalSystemStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - globalSystemStatus is NOK. Current state is: {{ $labels.globalSystemStatus }} |
| SNMP_DELL_iDRAC_systemStateBatteryStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateBatteryStatus is NOK. Current state is: {{ $labels.systemStateBatteryStatusCombined }}. Check RAID Controller BBU or CMOS battery in iDRAC. |
| SNMP_DELL_iDRAC_systemStateCoolingDeviceStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingDeviceStatusCombined }}. Check system fans in iDRAC. |
| SNMP_DELL_iDRAC_systemStateCoolingUnitStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingUnitStatusCombined }}. Check system fans in iDRAC. |
| SNMP_DELL_iDRAC_systemStateMemoryDeviceStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateMemoryDeviceStatus is NOK. Current state is: {{ $labels.systemStateMemoryDeviceStatusCombined }}. Check system volatile memory in iDRAC. |
| SNMP_DELL_iDRAC_systemStatePowerSupplyStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerSupplyStatus is NOK. Current state is: {{ $labels.systemStatePowerSupplyStatusCombined }}. Check system power supply in iDRAC. |
| SNMP_DELL_iDRAC_systemStatePowerUnitStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerUnitStatus is NOK. Current state is: {{ $labels.systemStatePowerUnitStatusCombined }}. Check system power supply or external power delivery in iDRAC. |
| SNMP_DELL_iDRAC_systemStateProcessorDeviceStatusCombined_NOK | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateProcessorDeviceStatus is NOK. Current state is: {{ $labels.systemStateProcessorDeviceStatusCombined }}. Check system processor in iDRAC. |
| SNMP_DELL_iDRAC_systemStateTemperatureStatisticsStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatisticsStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatisticsStatusCombined }}. Check system temperatures in iDRAC. |
| SNMP_DELL_iDRAC_systemStateTemperatureStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatusCombined }}. Check system temperatures in iDRAC. |
| SNMP_DELL_iDRAC_systemStateVoltageStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateVoltageStatus is NOK. Current state is: {{ $labels.systemStateVoltageStatusCombined }}. Check system voltage in iDRAC. |
| SNMP_DELL_iDRAC_systemStateAmperageStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateAmperageStatus is NOK. Current state is: {{ $labels.systemStateAmperageStatusCombined }}. Check system voltage in iDRAC. |
| SNMP_DELL_iDRAC_controllerRollUpStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerRollUpStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerRollUpStatus }}. |
| SNMP_DELL_iDRAC_controllerComponentStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerComponentStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerComponentStatus }}. |
| SNMP_DELL_iDRAC_physicalDiskState_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskState is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskState }}. |
| SNMP_DELL_iDRAC_physicalDiskComponentStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskComponentStatus is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskComponentStatus }}. |
| SNMP_DELL_iDRAC_physicalDiskSmartAlertIndication_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskSmartAlertIndication is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). |
| SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_WARNING | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 40 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }} |
| SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_CRITICAL | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 20 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }} |
| SNMP_DELL_iDRAC_virtualDiskState_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskState is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskState }}. |
| SNMP_DELL_iDRAC_virtualDiskComponentStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskComponentStatus is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskComponentStatus }}. |
| SNMP_DELL_iDRAC_virtualDiskBadBlocksDetected | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskBadBlocksDetected for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). |
alert-rules/allclusters/snmp-hp-cpq.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| SNMP_HP_CPQ_Overall_Health_NOK | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Overall health status is NOK. Value: "{{ $labels.cpqHeMibCondition }}" |
| SNMP_HP_CPQ_Event_Log_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Event Log Condition is NOK. Value: "{{ $labels.cpqHeEventLogCondition }}"}} |
| SNMP_HP_CPQ_CPU_Health_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - CPU status is NOK. Value: "{{ $labels.cpqSeCpuCondition }}"}} |
| SNMP_HP_CPQ_Thermal_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Thermal condition status is NOK. Value: "{{ $labels.cpqHeThermalCondition }}"}} |
| SNMP_HP_CPQ_Power_Supply_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ]- Power supply condition status is NOK. Value: "{{ $labels.cpqHeFltTolPwrSupplyCondition }}"}} |
| SNMP_HP_CPQ_Storage_Subsystem_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Storage subsystem condition status is NOK. Value: "{{ $labels.cpqSsMibCondition }}"}} |
| SNMP_HP_CPQ_Controller_Overall_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Controller "{{ $labels.cpqDaCntlrIndex }}"}} status is NOK. Value: "{{ $labels.cpqDaCntlrCondition }}"}}. This value represents the overall condition of this controller, and any associated logical drives, physical drives, and array accelerator. |
| SNMP_HP_CPQ_iLO_LicenseKey_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - HP iLO interface is missing its License activation. |
alert-rules/allclusters/snmp-lenovo-xcc.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| SNMP_Lenovo_XCC_systemHealthStat_NOK | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemHealthStat is not "normal". Current state is: {{ $labels.systemHealthStat }} |
| SNMP_Lenovo_XCC_cpuVpdHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - cpuVpdHealthStatus for CPU "{{ $labels.cpuVpdDescription }}" is not "normal". Current state is: {{ $labels.cpuVpdHealthStatus }} |
| SNMP_Lenovo_XCC_raidDriveHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - raidDriveHealthStatus for "{{ $labels.raidDriveName }}" is not "Normal". Current state is: {{ $labels.raidDriveHealthStatus }} |
| SNMP_Lenovo_XCC_memoryHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - memoryHealthStatus for DIMM "{{ $labels.memoryVpdDescription }}" is not "Normal". Current state is: {{ $labels.memoryHealthStatus }} |
| SNMP_Lenovo_XCC_fanHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - fanHealthStatus for Fan "{{ $labels.fanDescr }}" is not "Normal". Current state is: {{ $labels.fanHealthStatus }} |
| SNMP_Lenovo_XCC_voltHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - voltHealthStatus for System Component "{{ $labels.voltDescr }}" is not "Normal". Current state is: {{ $labels.voltHealthStatus }} |
alert-rules/management-cluster/flux.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Flux_Kustomization_Failing | 15m | warning | deployment | Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile. |
| Flux_Kustomization_Failing_Cluster | 60m | warning | deployment | Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile. |
| Flux_HelmRelease_Failing | 15m | warning | deployment | Flux HelmRelease "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile. |
| Flux_Source_Failing | 15m | warning | deployment | Flux Source "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile. |
| Flux_Resource_Suspended | 2h | warning | deployment | Flux Resource "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" suspended. |
alert-rules/management-cluster/harbor.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Harbor_Component_Status_NOK | 5m | warning | tools | Harbor component "{{ $labels.component }}" status is DOWN. |
alert-rules/management-cluster/health-alerts.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| KubeContainerWaitingManagement | 1h | critical | k8s | Pod "{{ $labels.namespace }}" / "{{ $labels.pod }}" has been in waiting state for more than 1 hour. |
alert-rules/management-cluster/minio.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| MinIO_Cluster_Health_Status_NOK | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status not OK. |
| MinIO_Cluster_Health_Status_Unknown | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status is Unknown. The cluster does not return cluster metrics. Check pods logs for error messages. |
| MinIO_Cluster_Disk_Offline | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" disk offline. |
| MinIO_Cluster_Disk_Space_Usage | 5m | warning | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 30%. |
| MinIO_Cluster_Disk_Space_Usage | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 10%. |
| MinIO_Cluster_Disk_Space_Will_Fill_Up_Soon | 5m | warning | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" at the current rate of utilization the available disk space will run out in the next 2 days. |
| MinIO_Cluster_Tolerance | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has lost quorum on pool "{{ $labels.pool }}" / set "{{ $labels.set }}" for more than 5 minutes. |
| MinIO_Nodes_Offline | 5m | warning | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has offline nodes. |
alert-rules/management-cluster/thanos.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| ThanosQueryStoreEndpointsMissing | 5m | critical | monitoring | Thanos Query is missing "{{ $labels.store_type }}" store type. Metrics served by this store type will not be available which can lead to alerting rules not evaluating properly. |
| ThanosCompactMultipleRunning | 5m | warning | monitoring | More than one Thanos Compact instance is running. Current number of instances: {{ $value }}. |
| ThanosCompactHalted | 5m | warning | monitoring | Thanos Compact {{ $labels.job }} has failed to run and now is halted. |
| ThanosCompactHighCompactionFailures | 15m | warning | monitoring | Thanos Compact {{ $labels.job }} is failing to execute {{ $value |
| ThanosCompactBucketHighOperationFailures | 15m | warning | monitoring | Thanos Compact {{ $labels.job }} Bucket is failing to execute {{ $value |
| ThanosCompactHasNotRun | 5m | warning | monitoring | Thanos Compact {{ $labels.job }} has not uploaded anything for 24 hours. |
| ThanosQueryHttpRequestQueryErrorRateHigh | 5m | critical | monitoring | Thanos Query {{ $labels.job }} is failing to handle {{ $value |
| ThanosQueryHttpRequestQueryRangeErrorRateHigh | 5m | critical | monitoring | Thanos Query {{ $labels.job }} is failing to handle {{ $value |
| ThanosQueryGrpcServerErrorRate | 5m | warning | monitoring | Thanos Query {{ $labels.job }} is failing to handle {{ $value |
| ThanosQueryGrpcClientErrorRate | 5m | warning | monitoring | Thanos Query {{ $labels.job }} is failing to send {{ $value |
| ThanosQueryHighDNSFailures | 15m | warning | monitoring | Thanos Query {{ $labels.job }} have {{ $value |
| ThanosQueryInstantLatencyHigh | 10m | critical | monitoring | Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for instant queries. |
| ThanosQueryRangeLatencyHigh | 10m | critical | monitoring | Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for range queries. |
| ThanosQueryOverload | 15m | warning | monitoring | Thanos Query {{ $labels.job }} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support. |
| ThanosReceiveHttpRequestErrorRateHigh | 5m | critical | monitoring | Thanos Receive {{ $labels.job }} is failing to handle {{ $value |
| ThanosReceiveHttpRequestLatencyHigh | 10m | critical | monitoring | Thanos Receive {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for requests. |
| ThanosReceiveHighReplicationFailures | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} is failing to replicate {{ $value |
| ThanosReceiveHighForwardRequestFailures | 5m | info | monitoring | Thanos Receive {{ $labels.job }} is failing to forward {{ $value |
| ThanosReceiveHighHashringFileRefreshFailures | 15m | warning | monitoring | Thanos Receive {{ $labels.job }} is failing to refresh hashring file, {{ $value |
| ThanosReceiveConfigReloadFailure | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} has not been able to reload hashring configurations. |
| ThanosReceiveNoUpload | 3h | critical | monitoring | Thanos Receive {{ $labels.pod }} has not uploaded latest data to object storage. |
| ThanosReceiveLimitsConfigReloadFailure | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} has not been able to reload the limits configuration. |
| ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} is failing for {{ $value |
| ThanosReceiveTenantLimitedByHeadSeries | 5m | warning | monitoring | Thanos Receive tenant {{ $labels.tenant }} is limited by head series. |
| ThanosStoreGrpcErrorRate | 5m | warning | monitoring | Thanos Store {{ $labels.job }} is failing to handle {{ $value |
| ThanosStoreSeriesGateLatencyHigh | 10m | warning | monitoring | Thanos Store {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for store series gate requests. |
| ThanosStoreBucketHighOperationFailures | 15m | warning | monitoring | Thanos Store {{ $labels.job }} Bucket is failing to execute {{ $value |
| ThanosStoreObjstoreOperationLatencyHigh | 10m | warning | monitoring | Thanos Store {{ $labels.job }} Bucket has a 99th percentile latency of {{ $value }} seconds for the bucket operations. |
| ThanosRuleQueueIsDroppingAlerts | 5m | critical | monitoring | Thanos Rule {{ $labels.pod }} is failing to queue alerts. |
| ThanosRuleSenderIsFailingAlerts | 5m | critical | monitoring | Thanos Rule {{ $labels.pod }} is failing to send alerts to alertmanager. |
| ThanosRuleHighRuleEvaluationFailures | 5m | critical | monitoring | Thanos Rule {{ $labels.pod }} is failing to evaluate rules. |
| ThanosRuleHighRuleEvaluationWarnings | 15m | info | monitoring | Thanos Rule {{ $labels.pod }} has high number of evaluation warnings. |
| ThanosRuleRuleEvaluationLatencyHigh | 5m | warning | monitoring | Thanos Rule {{ labels.pod }} has higher evaluation latency than interval for {{labels.rule_group}}. |
| ThanosRuleGrpcErrorRate | 5m | warning | monitoring | Thanos Ruler {{ $labels.pod }} is failing to handle {{ $value |
| ThanosRuleConfigReloadFailure | 5m | info | monitoring | Thanos Ruler {{ $labels.pod }} has not been able to reload its configuration. |
| ThanosRuleQueryHighDNSFailures | 15m | warning | monitoring | Thanos Ruler {{ $labels.pod }} has {{ $value |
| ThanosRuleAlertmanagerHighDNSFailures | 15m | warning | monitoring | Thanos Rule {{ $labels.pod }} has {{ $value |
| ThanosRuleNoEvaluationFor10Intervals | 5m | info | monitoring | Thanos Ruler {{ $labels.pod }} has rule groups that did not evaluate for at least 10x of their expected interval. |
| ThanosNoRuleEvaluations | 5m | critical | monitoring | Thanos Ruler {{ $labels.pod }} did not perform any rule evaluations in the past 10 minutes. |
| ThanosBucketReplicateErrorRate | 5m | critical | monitoring | Thanos Replicate is failing to run, {{ $value |
| ThanosBucketReplicateRunLatency | 5m | critical | monitoring | Thanos Replicate {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for the replicate operations. |
| ThanosCompactIsDown | 5m | critical | monitoring | Thanos Compact has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosQueryIsDown | 5m | critical | monitoring | Thanos Query has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosQueryFrontendIsDown | 5m | critical | monitoring | Thanos Query Frontend has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosReceiveIsDown | 5m | critical | monitoring | Thanos Receive has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosRuleIsDown | 5m | critical | monitoring | Thanos Ruler has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosStoreIsDown | 5m | critical | monitoring | Thanos Store has disappeared. Prometheus target for the component cannot be discovered. |
alert-rules/my-workload-cluster/health-alerts.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| KubeJobFailedWorkload | 15m | warning | k8s | Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert. |
v0.0.16: sylva-prometheus-rules: 0.0.16
Merge Requests integrated in this release
Monitoring & logging
- Add Thanos Query store type endpoint missing alert !91
CI
-
Update dependency renovate-bot/renovate-runner to v21
-
Update dependency renovate-bot/renovate-runner to v18.89.2 !54 renovate
-
Update dependency renovate-bot/renovate-runner to v18.96.5 !57 renovate
-
Update dependency renovate-bot/renovate-runner to v18.96.7 !60 renovate
-
Update dependency renovate-bot/renovate-runner to v19 !61 renovate
-
Update dependency renovate-bot/renovate-runner to v19.10.1 !62 renovate
-
Update dependency renovate-bot/renovate-runner to v19.19.0 !63 renovate
-
Update dependency renovate-bot/renovate-runner to v19.28.1 !65 renovate
-
Update dependency renovate-bot/renovate-runner to v19.41.2 !67 renovate
-
Update dependency renovate-bot/renovate-runner to v19.49.2 !69 renovate
-
Update dependency renovate-bot/renovate-runner to v19.56.1 !71 renovate
-
Update dependency renovate-bot/renovate-runner to v19.60.0 !72 renovate
-
Update dependency renovate-bot/renovate-runner to v19.64.0 !73 renovate
-
Update dependency renovate-bot/renovate-runner to v19.77.0 !75 renovate
-
Update dependency renovate-bot/renovate-runner to v19.84.1 !76 renovate
-
Update dependency renovate-bot/renovate-runner to v19.94.0 !79 renovate
-
Update dependency renovate-bot/renovate-runner to v19.107.1 !82 renovate
-
Update dependency renovate-bot/renovate-runner to v19.111.4 !83 renovate
-
Update dependency renovate-bot/renovate-runner to v20 !84 renovate
-
Update dependency renovate-bot/renovate-runner to v20.1.0 !86 renovate
-
Update dependency renovate-bot/renovate-runner to v21 !94 renovate
-
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.39
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.26 !55 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.27 !56 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.28 !58 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.29 !59 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.30 !64 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.31 !66 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.32 !70 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.33 !74 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.34 !77 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.35 !78 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.36 !80 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.37 !85 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.38 !87 renovate
-
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.39 !89 renovate
-
Other
- Add Lenovo XCC rules to SNMP conditionals !68
- Add 'home' field to Chart.yaml !90
- Fix path in pre-commit-hook script !92
- Remove Longhorn rules !95
Contributors
Alin H, Bogdan Antohe, Stephen Oresanya
sylva-prometheus-rules
Generate PrometheusRule objects for consumption by Prometheus
Overview
There are two mechanisms that control which rules are deployed
-
createRulesselects which directories are considered -
optional_rulesselects which files in those directories are added to the Configmap
Rules overrides
.Values.createRules controls which cluster rules are checked and the keys represent the directories under alert-rules/
If .Values.createRules.allclusters is true (default) then the alert-rules/allclusters/*yaml rules are parsed last, regardless of what other
clusters are specified
This allows for rule overriding. Example:
createRules:
allclusters: true
management-cluster: true
alert-rules/allclusters/health-alerts.yaml
alert-rules/allclusters/dummy.yaml
alert-rules/management-cluster/flux.yaml
alert-rules/management-cluster/health-alerts.yaml
alert-rules/management-cluster/minio.yaml
- First the
PrometheusRulewith the flux, minio and health-alerts name frommanagement-clusterare created. - Then health-alerts and dummy from
allclusterare parsed. Since health-alerts is already applied frommananagement-clusterit will not be applied again. dummy will be applied since it doesn't override anything
This in effect allows the user to override the health-alerts from allclusters with health-alerts form management-cluster
Rules activation
.Values.optional_rules controls which rules are enabled for optional components
Details about rules
alert-rules/allclusters/health-alerts.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| KubeJobFailedAllClusters | 15m | warning | k8s | Job "{{ $labels.namespace }}"/ "{{ $labels.job_name }}" failed to complete. Removing failed job after investigation should clear this alert. |
alert-rules/allclusters/snmp-dell-idrac.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| SNMP_DELL_iDRAC_globalSystemStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - globalSystemStatus is NOK. Current state is: {{ $labels.globalSystemStatus }} |
| SNMP_DELL_iDRAC_systemStateBatteryStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateBatteryStatus is NOK. Current state is: {{ $labels.systemStateBatteryStatusCombined }}. Check RAID Controller BBU or CMOS battery in iDRAC. |
| SNMP_DELL_iDRAC_systemStateCoolingDeviceStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingDeviceStatusCombined }}. Check system fans in iDRAC. |
| SNMP_DELL_iDRAC_systemStateCoolingUnitStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingUnitStatusCombined }}. Check system fans in iDRAC. |
| SNMP_DELL_iDRAC_systemStateMemoryDeviceStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateMemoryDeviceStatus is NOK. Current state is: {{ $labels.systemStateMemoryDeviceStatusCombined }}. Check system volatile memory in iDRAC. |
| SNMP_DELL_iDRAC_systemStatePowerSupplyStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerSupplyStatus is NOK. Current state is: {{ $labels.systemStatePowerSupplyStatusCombined }}. Check system power supply in iDRAC. |
| SNMP_DELL_iDRAC_systemStatePowerUnitStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerUnitStatus is NOK. Current state is: {{ $labels.systemStatePowerUnitStatusCombined }}. Check system power supply or external power delivery in iDRAC. |
| SNMP_DELL_iDRAC_systemStateProcessorDeviceStatusCombined_NOK | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateProcessorDeviceStatus is NOK. Current state is: {{ $labels.systemStateProcessorDeviceStatusCombined }}. Check system processor in iDRAC. |
| SNMP_DELL_iDRAC_systemStateTemperatureStatisticsStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatisticsStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatisticsStatusCombined }}. Check system temperatures in iDRAC. |
| SNMP_DELL_iDRAC_systemStateTemperatureStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatusCombined }}. Check system temperatures in iDRAC. |
| SNMP_DELL_iDRAC_systemStateVoltageStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateVoltageStatus is NOK. Current state is: {{ $labels.systemStateVoltageStatusCombined }}. Check system voltage in iDRAC. |
| SNMP_DELL_iDRAC_systemStateAmperageStatusCombined_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateAmperageStatus is NOK. Current state is: {{ $labels.systemStateAmperageStatusCombined }}. Check system voltage in iDRAC. |
| SNMP_DELL_iDRAC_controllerRollUpStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerRollUpStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerRollUpStatus }}. |
| SNMP_DELL_iDRAC_controllerComponentStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerComponentStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerComponentStatus }}. |
| SNMP_DELL_iDRAC_physicalDiskState_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskState is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskState }}. |
| SNMP_DELL_iDRAC_physicalDiskComponentStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskComponentStatus is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskComponentStatus }}. |
| SNMP_DELL_iDRAC_physicalDiskSmartAlertIndication_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskSmartAlertIndication is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). |
| SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_WARNING | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 40 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }} |
| SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_CRITICAL | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 20 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }} |
| SNMP_DELL_iDRAC_virtualDiskState_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskState is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskState }}. |
| SNMP_DELL_iDRAC_virtualDiskComponentStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskComponentStatus is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskComponentStatus }}. |
| SNMP_DELL_iDRAC_virtualDiskBadBlocksDetected | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskBadBlocksDetected for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). |
alert-rules/allclusters/snmp-hp-cpq.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| SNMP_HP_CPQ_Overall_Health_NOK | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Overall health status is NOK. Value: "{{ $labels.cpqHeMibCondition }}" |
| SNMP_HP_CPQ_Event_Log_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Event Log Condition is NOK. Value: "{{ $labels.cpqHeEventLogCondition }}"}} |
| SNMP_HP_CPQ_CPU_Health_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - CPU status is NOK. Value: "{{ $labels.cpqSeCpuCondition }}"}} |
| SNMP_HP_CPQ_Thermal_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Thermal condition status is NOK. Value: "{{ $labels.cpqHeThermalCondition }}"}} |
| SNMP_HP_CPQ_Power_Supply_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ]- Power supply condition status is NOK. Value: "{{ $labels.cpqHeFltTolPwrSupplyCondition }}"}} |
| SNMP_HP_CPQ_Storage_Subsystem_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Storage subsystem condition status is NOK. Value: "{{ $labels.cpqSsMibCondition }}"}} |
| SNMP_HP_CPQ_Controller_Overall_Condition_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Controller "{{ $labels.cpqDaCntlrIndex }}"}} status is NOK. Value: "{{ $labels.cpqDaCntlrCondition }}"}}. This value represents the overall condition of this controller, and any associated logical drives, physical drives, and array accelerator. |
| SNMP_HP_CPQ_iLO_LicenseKey_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - HP iLO interface is missing its License activation. |
alert-rules/allclusters/snmp-lenovo-xcc.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| SNMP_Lenovo_XCC_systemHealthStat_NOK | 5m | critical | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemHealthStat is not "normal". Current state is: {{ $labels.systemHealthStat }} |
| SNMP_Lenovo_XCC_cpuVpdHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - cpuVpdHealthStatus for CPU "{{ $labels.cpuVpdDescription }}" is not "normal". Current state is: {{ $labels.cpuVpdHealthStatus }} |
| SNMP_Lenovo_XCC_raidDriveHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - raidDriveHealthStatus for "{{ $labels.raidDriveName }}" is not "Normal". Current state is: {{ $labels.raidDriveHealthStatus }} |
| SNMP_Lenovo_XCC_memoryHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - memoryHealthStatus for DIMM "{{ $labels.memoryVpdDescription }}" is not "Normal". Current state is: {{ $labels.memoryHealthStatus }} |
| SNMP_Lenovo_XCC_fanHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - fanHealthStatus for Fan "{{ $labels.fanDescr }}" is not "Normal". Current state is: {{ $labels.fanHealthStatus }} |
| SNMP_Lenovo_XCC_voltHealthStatus_NOK | 5m | warning | hardware | Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - voltHealthStatus for System Component "{{ $labels.voltDescr }}" is not "Normal". Current state is: {{ $labels.voltHealthStatus }} |
alert-rules/management-cluster/flux.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| Flux_Kustomization_Failing | 15m | warning | deployment | Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile. |
| Flux_Kustomization_Failing_Cluster | 60m | warning | deployment | Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile. |
| Flux_HelmRelease_Failing | 15m | warning | deployment | Flux HelmRelease "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile. |
| Flux_Source_Failing | 15m | warning | deployment | Flux Source "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile. |
| Flux_Resource_Suspended | 2h | warning | deployment | Flux Resource "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" suspended. |
alert-rules/management-cluster/health-alerts.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| KubeContainerWaitingManagement | 1h | critical | k8s | Pod "{{ $labels.namespace }}" / "{{ $labels.pod }}" has been in waiting state for more than 1 hour. |
alert-rules/management-cluster/minio.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| MinIO_Cluster_Health_Status_NOK | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status not OK. |
| MinIO_Cluster_Health_Status_Unknown | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status is Unknown. The cluster does not return cluster metrics. Check pods logs for error messages. |
| MinIO_Cluster_Disk_Offline | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" disk offline. |
| MinIO_Cluster_Disk_Space_Usage | 5m | warning | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 30%. |
| MinIO_Cluster_Disk_Space_Usage | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 10%. |
| MinIO_Cluster_Disk_Space_Will_Fill_Up_Soon | 5m | warning | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" at the current rate of utilization the available disk space will run out in the next 2 days. |
| MinIO_Cluster_Tolerance | 5m | critical | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has lost quorum on pool "{{ $labels.pool }}" / set "{{ $labels.set }}" for more than 5 minutes. |
| MinIO_Nodes_Offline | 5m | warning | storage | MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has offline nodes. |
alert-rules/management-cluster/thanos.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| ThanosQueryStoreEndpointsMissing | 5m | critical | monitoring | Thanos Query is missing "{{ $labels.store_type }}" store type. Metrics served by this store type will not be available which can lead to alerting rules not evaluating properly. |
| ThanosCompactMultipleRunning | 5m | warning | monitoring | More than one Thanos Compact instance is running. Current number of instances: {{ $value }}. |
| ThanosCompactHalted | 5m | warning | monitoring | Thanos Compact {{ $labels.job }} has failed to run and now is halted. |
| ThanosCompactHighCompactionFailures | 15m | warning | monitoring | Thanos Compact {{ $labels.job }} is failing to execute {{ $value |
| ThanosCompactBucketHighOperationFailures | 15m | warning | monitoring | Thanos Compact {{ $labels.job }} Bucket is failing to execute {{ $value |
| ThanosCompactHasNotRun | 5m | warning | monitoring | Thanos Compact {{ $labels.job }} has not uploaded anything for 24 hours. |
| ThanosQueryHttpRequestQueryErrorRateHigh | 5m | critical | monitoring | Thanos Query {{ $labels.job }} is failing to handle {{ $value |
| ThanosQueryHttpRequestQueryRangeErrorRateHigh | 5m | critical | monitoring | Thanos Query {{ $labels.job }} is failing to handle {{ $value |
| ThanosQueryGrpcServerErrorRate | 5m | warning | monitoring | Thanos Query {{ $labels.job }} is failing to handle {{ $value |
| ThanosQueryGrpcClientErrorRate | 5m | warning | monitoring | Thanos Query {{ $labels.job }} is failing to send {{ $value |
| ThanosQueryHighDNSFailures | 15m | warning | monitoring | Thanos Query {{ $labels.job }} have {{ $value |
| ThanosQueryInstantLatencyHigh | 10m | critical | monitoring | Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for instant queries. |
| ThanosQueryRangeLatencyHigh | 10m | critical | monitoring | Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for range queries. |
| ThanosQueryOverload | 15m | warning | monitoring | Thanos Query {{ $labels.job }} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support. |
| ThanosReceiveHttpRequestErrorRateHigh | 5m | critical | monitoring | Thanos Receive {{ $labels.job }} is failing to handle {{ $value |
| ThanosReceiveHttpRequestLatencyHigh | 10m | critical | monitoring | Thanos Receive {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for requests. |
| ThanosReceiveHighReplicationFailures | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} is failing to replicate {{ $value |
| ThanosReceiveHighForwardRequestFailures | 5m | info | monitoring | Thanos Receive {{ $labels.job }} is failing to forward {{ $value |
| ThanosReceiveHighHashringFileRefreshFailures | 15m | warning | monitoring | Thanos Receive {{ $labels.job }} is failing to refresh hashring file, {{ $value |
| ThanosReceiveConfigReloadFailure | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} has not been able to reload hashring configurations. |
| ThanosReceiveNoUpload | 3h | critical | monitoring | Thanos Receive {{ $labels.pod }} has not uploaded latest data to object storage. |
| ThanosReceiveLimitsConfigReloadFailure | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} has not been able to reload the limits configuration. |
| ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate | 5m | warning | monitoring | Thanos Receive {{ $labels.job }} is failing for {{ $value |
| ThanosReceiveTenantLimitedByHeadSeries | 5m | warning | monitoring | Thanos Receive tenant {{ $labels.tenant }} is limited by head series. |
| ThanosStoreGrpcErrorRate | 5m | warning | monitoring | Thanos Store {{ $labels.job }} is failing to handle {{ $value |
| ThanosStoreSeriesGateLatencyHigh | 10m | warning | monitoring | Thanos Store {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for store series gate requests. |
| ThanosStoreBucketHighOperationFailures | 15m | warning | monitoring | Thanos Store {{ $labels.job }} Bucket is failing to execute {{ $value |
| ThanosStoreObjstoreOperationLatencyHigh | 10m | warning | monitoring | Thanos Store {{ $labels.job }} Bucket has a 99th percentile latency of {{ $value }} seconds for the bucket operations. |
| ThanosRuleQueueIsDroppingAlerts | 5m | critical | monitoring | Thanos Rule {{ $labels.pod }} is failing to queue alerts. |
| ThanosRuleSenderIsFailingAlerts | 5m | critical | monitoring | Thanos Rule {{ $labels.pod }} is failing to send alerts to alertmanager. |
| ThanosRuleHighRuleEvaluationFailures | 5m | critical | monitoring | Thanos Rule {{ $labels.pod }} is failing to evaluate rules. |
| ThanosRuleHighRuleEvaluationWarnings | 15m | info | monitoring | Thanos Rule {{ $labels.pod }} has high number of evaluation warnings. |
| ThanosRuleRuleEvaluationLatencyHigh | 5m | warning | monitoring | Thanos Rule {{ labels.pod }} has higher evaluation latency than interval for {{labels.rule_group}}. |
| ThanosRuleGrpcErrorRate | 5m | warning | monitoring | Thanos Ruler {{ $labels.pod }} is failing to handle {{ $value |
| ThanosRuleConfigReloadFailure | 5m | info | monitoring | Thanos Ruler {{ $labels.pod }} has not been able to reload its configuration. |
| ThanosRuleQueryHighDNSFailures | 15m | warning | monitoring | Thanos Ruler {{ $labels.pod }} has {{ $value |
| ThanosRuleAlertmanagerHighDNSFailures | 15m | warning | monitoring | Thanos Rule {{ $labels.pod }} has {{ $value |
| ThanosRuleNoEvaluationFor10Intervals | 5m | info | monitoring | Thanos Ruler {{ $labels.pod }} has rule groups that did not evaluate for at least 10x of their expected interval. |
| ThanosNoRuleEvaluations | 5m | critical | monitoring | Thanos Ruler {{ $labels.pod }} did not perform any rule evaluations in the past 10 minutes. |
| ThanosBucketReplicateErrorRate | 5m | critical | monitoring | Thanos Replicate is failing to run, {{ $value |
| ThanosBucketReplicateRunLatency | 5m | critical | monitoring | Thanos Replicate {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for the replicate operations. |
| ThanosCompactIsDown | 5m | critical | monitoring | Thanos Compact has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosQueryIsDown | 5m | critical | monitoring | Thanos Query has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosQueryFrontendIsDown | 5m | critical | monitoring | Thanos Query Frontend has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosReceiveIsDown | 5m | critical | monitoring | Thanos Receive has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosRuleIsDown | 5m | critical | monitoring | Thanos Ruler has disappeared. Prometheus target for the component cannot be discovered. |
| ThanosStoreIsDown | 5m | critical | monitoring | Thanos Store has disappeared. Prometheus target for the component cannot be discovered. |
alert-rules/my-workload-cluster/health-alerts.yaml
| Alert Name | For | Severity | Type | Description |
|---|---|---|---|---|
| KubeJobFailedWorkload | 15m | warning | k8s | Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert. |
Configuration
-
If you want to rebase/retry this MR, check this box
This MR has been generated by Renovate Bot Sylva instance.
CI configuration couldn't be handle by MR description. A dedicated comment has been posted to control it.
If no checkbox is checked, a default pipeline will be enabled (capm3, or capo if capo label is set)