Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git to v0.0.17 (release-1.3) (!5230) · Merge requests · Sylva-projects / sylva-core

This MR contains the following updates:

Package	Update	Change
https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git	patch	`0.0.15` -> `0.0.17`

Release Notes

sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git)

`v0.0.17`: sylva-prometheus-rules: 0.0.17

Compare Source

Merge Requests integrated in this release

CI

Update dependency renovate-bot/renovate-runner to v22 !96 renovate
Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.40 !97 renovate

Other

Add Harbor rule !98

Contributors

Alin H

sylva-prometheus-rules

Generate PrometheusRule objects for consumption by Prometheus

Overview

There are two mechanisms that control which rules are deployed

createRules selects which directories are considered
optional_rules selects which files in those directories are added to the Configmap

Rules overrides

.Values.createRules controls which cluster rules are checked and the keys represent the directories under alert-rules/

If .Values.createRules.allclusters is true (default) then the alert-rules/allclusters/*yaml rules are parsed last, regardless of what other clusters are specified

This allows for rule overriding. Example:

createRules:
  allclusters: true
  management-cluster: true

alert-rules/allclusters/health-alerts.yaml
alert-rules/allclusters/dummy.yaml

alert-rules/management-cluster/flux.yaml
alert-rules/management-cluster/health-alerts.yaml
alert-rules/management-cluster/minio.yaml

First the PrometheusRule with the flux, minio and health-alerts name from management-cluster are created.
Then health-alerts and dummy from allcluster are parsed. Since health-alerts is already applied from mananagement-cluster it will not be applied again. dummy will be applied since it doesn't override anything

This in effect allows the user to override the health-alerts from allclusters with health-alerts form management-cluster

Rules activation

.Values.optional_rules controls which rules are enabled for optional components

Details about rules

alert-rules/allclusters/health-alerts.yaml

Alert Name	For	Severity	Type	Description
KubeJobFailedAllClusters	15m	warning	k8s	Job "{{ $labels.namespace }}"/ "{{ $labels.job_name }}" failed to complete. Removing failed job after investigation should clear this alert.

alert-rules/allclusters/snmp-dell-idrac.yaml

Alert Name	For	Severity	Type	Description
SNMP_DELL_iDRAC_globalSystemStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - globalSystemStatus is NOK. Current state is: {{ $labels.globalSystemStatus }}
SNMP_DELL_iDRAC_systemStateBatteryStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateBatteryStatus is NOK. Current state is: {{ $labels.systemStateBatteryStatusCombined }}. Check RAID Controller BBU or CMOS battery in iDRAC.
SNMP_DELL_iDRAC_systemStateCoolingDeviceStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingDeviceStatusCombined }}. Check system fans in iDRAC.
SNMP_DELL_iDRAC_systemStateCoolingUnitStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingUnitStatusCombined }}. Check system fans in iDRAC.
SNMP_DELL_iDRAC_systemStateMemoryDeviceStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateMemoryDeviceStatus is NOK. Current state is: {{ $labels.systemStateMemoryDeviceStatusCombined }}. Check system volatile memory in iDRAC.
SNMP_DELL_iDRAC_systemStatePowerSupplyStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerSupplyStatus is NOK. Current state is: {{ $labels.systemStatePowerSupplyStatusCombined }}. Check system power supply in iDRAC.
SNMP_DELL_iDRAC_systemStatePowerUnitStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerUnitStatus is NOK. Current state is: {{ $labels.systemStatePowerUnitStatusCombined }}. Check system power supply or external power delivery in iDRAC.
SNMP_DELL_iDRAC_systemStateProcessorDeviceStatusCombined_NOK	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateProcessorDeviceStatus is NOK. Current state is: {{ $labels.systemStateProcessorDeviceStatusCombined }}. Check system processor in iDRAC.
SNMP_DELL_iDRAC_systemStateTemperatureStatisticsStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatisticsStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatisticsStatusCombined }}. Check system temperatures in iDRAC.
SNMP_DELL_iDRAC_systemStateTemperatureStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatusCombined }}. Check system temperatures in iDRAC.
SNMP_DELL_iDRAC_systemStateVoltageStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateVoltageStatus is NOK. Current state is: {{ $labels.systemStateVoltageStatusCombined }}. Check system voltage in iDRAC.
SNMP_DELL_iDRAC_systemStateAmperageStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateAmperageStatus is NOK. Current state is: {{ $labels.systemStateAmperageStatusCombined }}. Check system voltage in iDRAC.
SNMP_DELL_iDRAC_controllerRollUpStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerRollUpStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerRollUpStatus }}.
SNMP_DELL_iDRAC_controllerComponentStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerComponentStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerComponentStatus }}.
SNMP_DELL_iDRAC_physicalDiskState_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskState is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskState }}.
SNMP_DELL_iDRAC_physicalDiskComponentStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskComponentStatus is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskComponentStatus }}.
SNMP_DELL_iDRAC_physicalDiskSmartAlertIndication_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskSmartAlertIndication is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}).
SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_WARNING	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 40 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }}
SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_CRITICAL	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 20 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }}
SNMP_DELL_iDRAC_virtualDiskState_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskState is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskState }}.
SNMP_DELL_iDRAC_virtualDiskComponentStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskComponentStatus is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskComponentStatus }}.
SNMP_DELL_iDRAC_virtualDiskBadBlocksDetected	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskBadBlocksDetected for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}).

alert-rules/allclusters/snmp-hp-cpq.yaml

Alert Name	For	Severity	Type	Description
SNMP_HP_CPQ_Overall_Health_NOK	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Overall health status is NOK. Value: "{{ $labels.cpqHeMibCondition }}"
SNMP_HP_CPQ_Event_Log_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Event Log Condition is NOK. Value: "{{ $labels.cpqHeEventLogCondition }}"}}
SNMP_HP_CPQ_CPU_Health_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - CPU status is NOK. Value: "{{ $labels.cpqSeCpuCondition }}"}}
SNMP_HP_CPQ_Thermal_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Thermal condition status is NOK. Value: "{{ $labels.cpqHeThermalCondition }}"}}
SNMP_HP_CPQ_Power_Supply_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ]- Power supply condition status is NOK. Value: "{{ $labels.cpqHeFltTolPwrSupplyCondition }}"}}
SNMP_HP_CPQ_Storage_Subsystem_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Storage subsystem condition status is NOK. Value: "{{ $labels.cpqSsMibCondition }}"}}
SNMP_HP_CPQ_Controller_Overall_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Controller "{{ $labels.cpqDaCntlrIndex }}"}} status is NOK. Value: "{{ $labels.cpqDaCntlrCondition }}"}}. This value represents the overall condition of this controller, and any associated logical drives, physical drives, and array accelerator.
SNMP_HP_CPQ_iLO_LicenseKey_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - HP iLO interface is missing its License activation.

alert-rules/allclusters/snmp-lenovo-xcc.yaml

Alert Name	For	Severity	Type	Description
SNMP_Lenovo_XCC_systemHealthStat_NOK	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemHealthStat is not "normal". Current state is: {{ $labels.systemHealthStat }}
SNMP_Lenovo_XCC_cpuVpdHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - cpuVpdHealthStatus for CPU "{{ $labels.cpuVpdDescription }}" is not "normal". Current state is: {{ $labels.cpuVpdHealthStatus }}
SNMP_Lenovo_XCC_raidDriveHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - raidDriveHealthStatus for "{{ $labels.raidDriveName }}" is not "Normal". Current state is: {{ $labels.raidDriveHealthStatus }}
SNMP_Lenovo_XCC_memoryHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - memoryHealthStatus for DIMM "{{ $labels.memoryVpdDescription }}" is not "Normal". Current state is: {{ $labels.memoryHealthStatus }}
SNMP_Lenovo_XCC_fanHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - fanHealthStatus for Fan "{{ $labels.fanDescr }}" is not "Normal". Current state is: {{ $labels.fanHealthStatus }}
SNMP_Lenovo_XCC_voltHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - voltHealthStatus for System Component "{{ $labels.voltDescr }}" is not "Normal". Current state is: {{ $labels.voltHealthStatus }}

alert-rules/management-cluster/flux.yaml

Alert Name	For	Severity	Type	Description
Flux_Kustomization_Failing	15m	warning	deployment	Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile.
Flux_Kustomization_Failing_Cluster	60m	warning	deployment	Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile.
Flux_HelmRelease_Failing	15m	warning	deployment	Flux HelmRelease "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile.
Flux_Source_Failing	15m	warning	deployment	Flux Source "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile.
Flux_Resource_Suspended	2h	warning	deployment	Flux Resource "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" suspended.

alert-rules/management-cluster/harbor.yaml

Alert Name	For	Severity	Type	Description
Harbor_Component_Status_NOK	5m	warning	tools	Harbor component "{{ $labels.component }}" status is DOWN.

alert-rules/management-cluster/health-alerts.yaml

Alert Name	For	Severity	Type	Description
KubeContainerWaitingManagement	1h	critical	k8s	Pod "{{ $labels.namespace }}" / "{{ $labels.pod }}" has been in waiting state for more than 1 hour.

alert-rules/management-cluster/minio.yaml

Alert Name	For	Severity	Type	Description
MinIO_Cluster_Health_Status_NOK	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status not OK.
MinIO_Cluster_Health_Status_Unknown	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status is Unknown. The cluster does not return cluster metrics. Check pods logs for error messages.
MinIO_Cluster_Disk_Offline	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" disk offline.
MinIO_Cluster_Disk_Space_Usage	5m	warning	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 30%.
MinIO_Cluster_Disk_Space_Usage	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 10%.
MinIO_Cluster_Disk_Space_Will_Fill_Up_Soon	5m	warning	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" at the current rate of utilization the available disk space will run out in the next 2 days.
MinIO_Cluster_Tolerance	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has lost quorum on pool "{{ $labels.pool }}" / set "{{ $labels.set }}" for more than 5 minutes.
MinIO_Nodes_Offline	5m	warning	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has offline nodes.

alert-rules/management-cluster/thanos.yaml

Alert Name	For	Severity	Type	Description
ThanosQueryStoreEndpointsMissing	5m	critical	monitoring	Thanos Query is missing "{{ $labels.store_type }}" store type. Metrics served by this store type will not be available which can lead to alerting rules not evaluating properly.
ThanosCompactMultipleRunning	5m	warning	monitoring	More than one Thanos Compact instance is running. Current number of instances: {{ $value }}.
ThanosCompactHalted	5m	warning	monitoring	Thanos Compact {{ $labels.job }} has failed to run and now is halted.
ThanosCompactHighCompactionFailures	15m	warning	monitoring	Thanos Compact {{ $labels.job }} is failing to execute {{ $value
ThanosCompactBucketHighOperationFailures	15m	warning	monitoring	Thanos Compact {{ $labels.job }} Bucket is failing to execute {{ $value
ThanosCompactHasNotRun	5m	warning	monitoring	Thanos Compact {{ $labels.job }} has not uploaded anything for 24 hours.
ThanosQueryHttpRequestQueryErrorRateHigh	5m	critical	monitoring	Thanos Query {{ $labels.job }} is failing to handle {{ $value
ThanosQueryHttpRequestQueryRangeErrorRateHigh	5m	critical	monitoring	Thanos Query {{ $labels.job }} is failing to handle {{ $value
ThanosQueryGrpcServerErrorRate	5m	warning	monitoring	Thanos Query {{ $labels.job }} is failing to handle {{ $value
ThanosQueryGrpcClientErrorRate	5m	warning	monitoring	Thanos Query {{ $labels.job }} is failing to send {{ $value
ThanosQueryHighDNSFailures	15m	warning	monitoring	Thanos Query {{ $labels.job }} have {{ $value
ThanosQueryInstantLatencyHigh	10m	critical	monitoring	Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for instant queries.
ThanosQueryRangeLatencyHigh	10m	critical	monitoring	Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for range queries.
ThanosQueryOverload	15m	warning	monitoring	Thanos Query {{ $labels.job }} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.
ThanosReceiveHttpRequestErrorRateHigh	5m	critical	monitoring	Thanos Receive {{ $labels.job }} is failing to handle {{ $value
ThanosReceiveHttpRequestLatencyHigh	10m	critical	monitoring	Thanos Receive {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for requests.
ThanosReceiveHighReplicationFailures	5m	warning	monitoring	Thanos Receive {{ $labels.job }} is failing to replicate {{ $value
ThanosReceiveHighForwardRequestFailures	5m	info	monitoring	Thanos Receive {{ $labels.job }} is failing to forward {{ $value
ThanosReceiveHighHashringFileRefreshFailures	15m	warning	monitoring	Thanos Receive {{ $labels.job }} is failing to refresh hashring file, {{ $value
ThanosReceiveConfigReloadFailure	5m	warning	monitoring	Thanos Receive {{ $labels.job }} has not been able to reload hashring configurations.
ThanosReceiveNoUpload	3h	critical	monitoring	Thanos Receive {{ $labels.pod }} has not uploaded latest data to object storage.
ThanosReceiveLimitsConfigReloadFailure	5m	warning	monitoring	Thanos Receive {{ $labels.job }} has not been able to reload the limits configuration.
ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate	5m	warning	monitoring	Thanos Receive {{ $labels.job }} is failing for {{ $value
ThanosReceiveTenantLimitedByHeadSeries	5m	warning	monitoring	Thanos Receive tenant {{ $labels.tenant }} is limited by head series.
ThanosStoreGrpcErrorRate	5m	warning	monitoring	Thanos Store {{ $labels.job }} is failing to handle {{ $value
ThanosStoreSeriesGateLatencyHigh	10m	warning	monitoring	Thanos Store {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for store series gate requests.
ThanosStoreBucketHighOperationFailures	15m	warning	monitoring	Thanos Store {{ $labels.job }} Bucket is failing to execute {{ $value
ThanosStoreObjstoreOperationLatencyHigh	10m	warning	monitoring	Thanos Store {{ $labels.job }} Bucket has a 99th percentile latency of {{ $value }} seconds for the bucket operations.
ThanosRuleQueueIsDroppingAlerts	5m	critical	monitoring	Thanos Rule {{ $labels.pod }} is failing to queue alerts.
ThanosRuleSenderIsFailingAlerts	5m	critical	monitoring	Thanos Rule {{ $labels.pod }} is failing to send alerts to alertmanager.
ThanosRuleHighRuleEvaluationFailures	5m	critical	monitoring	Thanos Rule {{ $labels.pod }} is failing to evaluate rules.
ThanosRuleHighRuleEvaluationWarnings	15m	info	monitoring	Thanos Rule {{ $labels.pod }} has high number of evaluation warnings.
ThanosRuleRuleEvaluationLatencyHigh	5m	warning	monitoring	Thanos Rule {{ labels.pod }} has higher evaluation latency than interval for {{labels.rule_group}}.
ThanosRuleGrpcErrorRate	5m	warning	monitoring	Thanos Ruler {{ $labels.pod }} is failing to handle {{ $value
ThanosRuleConfigReloadFailure	5m	info	monitoring	Thanos Ruler {{ $labels.pod }} has not been able to reload its configuration.
ThanosRuleQueryHighDNSFailures	15m	warning	monitoring	Thanos Ruler {{ $labels.pod }} has {{ $value
ThanosRuleAlertmanagerHighDNSFailures	15m	warning	monitoring	Thanos Rule {{ $labels.pod }} has {{ $value
ThanosRuleNoEvaluationFor10Intervals	5m	info	monitoring	Thanos Ruler {{ $labels.pod }} has rule groups that did not evaluate for at least 10x of their expected interval.
ThanosNoRuleEvaluations	5m	critical	monitoring	Thanos Ruler {{ $labels.pod }} did not perform any rule evaluations in the past 10 minutes.
ThanosBucketReplicateErrorRate	5m	critical	monitoring	Thanos Replicate is failing to run, {{ $value
ThanosBucketReplicateRunLatency	5m	critical	monitoring	Thanos Replicate {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for the replicate operations.
ThanosCompactIsDown	5m	critical	monitoring	Thanos Compact has disappeared. Prometheus target for the component cannot be discovered.
ThanosQueryIsDown	5m	critical	monitoring	Thanos Query has disappeared. Prometheus target for the component cannot be discovered.
ThanosQueryFrontendIsDown	5m	critical	monitoring	Thanos Query Frontend has disappeared. Prometheus target for the component cannot be discovered.
ThanosReceiveIsDown	5m	critical	monitoring	Thanos Receive has disappeared. Prometheus target for the component cannot be discovered.
ThanosRuleIsDown	5m	critical	monitoring	Thanos Ruler has disappeared. Prometheus target for the component cannot be discovered.
ThanosStoreIsDown	5m	critical	monitoring	Thanos Store has disappeared. Prometheus target for the component cannot be discovered.

alert-rules/my-workload-cluster/health-alerts.yaml

Alert Name	For	Severity	Type	Description
KubeJobFailedWorkload	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert.

`v0.0.16`: sylva-prometheus-rules: 0.0.16

Compare Source

Merge Requests integrated in this release

Monitoring & logging

Add Thanos Query store type endpoint missing alert !91

CI

Update dependency renovate-bot/renovate-runner to v21
- Update dependency renovate-bot/renovate-runner to v18.89.2 !54 renovate
- Update dependency renovate-bot/renovate-runner to v18.96.5 !57 renovate
- Update dependency renovate-bot/renovate-runner to v18.96.7 !60 renovate
- Update dependency renovate-bot/renovate-runner to v19 !61 renovate
- Update dependency renovate-bot/renovate-runner to v19.10.1 !62 renovate
- Update dependency renovate-bot/renovate-runner to v19.19.0 !63 renovate
- Update dependency renovate-bot/renovate-runner to v19.28.1 !65 renovate
- Update dependency renovate-bot/renovate-runner to v19.41.2 !67 renovate
- Update dependency renovate-bot/renovate-runner to v19.49.2 !69 renovate
- Update dependency renovate-bot/renovate-runner to v19.56.1 !71 renovate
- Update dependency renovate-bot/renovate-runner to v19.60.0 !72 renovate
- Update dependency renovate-bot/renovate-runner to v19.64.0 !73 renovate
- Update dependency renovate-bot/renovate-runner to v19.77.0 !75 renovate
- Update dependency renovate-bot/renovate-runner to v19.84.1 !76 renovate
- Update dependency renovate-bot/renovate-runner to v19.94.0 !79 renovate
- Update dependency renovate-bot/renovate-runner to v19.107.1 !82 renovate
- Update dependency renovate-bot/renovate-runner to v19.111.4 !83 renovate
- Update dependency renovate-bot/renovate-runner to v20 !84 renovate
- Update dependency renovate-bot/renovate-runner to v20.1.0 !86 renovate
- Update dependency renovate-bot/renovate-runner to v21 !94 renovate

Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.39
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.26 !55 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.27 !56 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.28 !58 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.29 !59 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.30 !64 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.31 !66 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.32 !70 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.33 !74 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.34 !77 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.35 !78 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.36 !80 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.37 !85 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.38 !87 renovate
- Update dependency sylva-projects/sylva-elements/ci-tooling/ci-templates to v1.0.39 !89 renovate

Update dependency to-be-continuous/gitleaks to v2.7.1
- Update dependency to-be-continuous/gitleaks to v2.7.0 !81 renovate
- Update dependency to-be-continuous/gitleaks to v2.7.1 !93 renovate

Automate update README file !88 docs

Other

Add Lenovo XCC rules to SNMP conditionals !68
Add 'home' field to Chart.yaml !90
Fix path in pre-commit-hook script !92
Remove Longhorn rules !95

Contributors

Alin H, Bogdan Antohe, Stephen Oresanya

sylva-prometheus-rules

Generate PrometheusRule objects for consumption by Prometheus

Overview

There are two mechanisms that control which rules are deployed

createRules selects which directories are considered
optional_rules selects which files in those directories are added to the Configmap

Rules overrides

.Values.createRules controls which cluster rules are checked and the keys represent the directories under alert-rules/

If .Values.createRules.allclusters is true (default) then the alert-rules/allclusters/*yaml rules are parsed last, regardless of what other clusters are specified

This allows for rule overriding. Example:

createRules:
  allclusters: true
  management-cluster: true

alert-rules/allclusters/health-alerts.yaml
alert-rules/allclusters/dummy.yaml

alert-rules/management-cluster/flux.yaml
alert-rules/management-cluster/health-alerts.yaml
alert-rules/management-cluster/minio.yaml

First the PrometheusRule with the flux, minio and health-alerts name from management-cluster are created.
Then health-alerts and dummy from allcluster are parsed. Since health-alerts is already applied from mananagement-cluster it will not be applied again. dummy will be applied since it doesn't override anything

This in effect allows the user to override the health-alerts from allclusters with health-alerts form management-cluster

Rules activation

.Values.optional_rules controls which rules are enabled for optional components

Details about rules

alert-rules/allclusters/health-alerts.yaml

Alert Name	For	Severity	Type	Description
KubeJobFailedAllClusters	15m	warning	k8s	Job "{{ $labels.namespace }}"/ "{{ $labels.job_name }}" failed to complete. Removing failed job after investigation should clear this alert.

alert-rules/allclusters/snmp-dell-idrac.yaml

Alert Name	For	Severity	Type	Description
SNMP_DELL_iDRAC_globalSystemStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - globalSystemStatus is NOK. Current state is: {{ $labels.globalSystemStatus }}
SNMP_DELL_iDRAC_systemStateBatteryStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateBatteryStatus is NOK. Current state is: {{ $labels.systemStateBatteryStatusCombined }}. Check RAID Controller BBU or CMOS battery in iDRAC.
SNMP_DELL_iDRAC_systemStateCoolingDeviceStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingDeviceStatusCombined }}. Check system fans in iDRAC.
SNMP_DELL_iDRAC_systemStateCoolingUnitStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateCoolingDeviceStatus is NOK. Current state is: {{ $labels.systemStateCoolingUnitStatusCombined }}. Check system fans in iDRAC.
SNMP_DELL_iDRAC_systemStateMemoryDeviceStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateMemoryDeviceStatus is NOK. Current state is: {{ $labels.systemStateMemoryDeviceStatusCombined }}. Check system volatile memory in iDRAC.
SNMP_DELL_iDRAC_systemStatePowerSupplyStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerSupplyStatus is NOK. Current state is: {{ $labels.systemStatePowerSupplyStatusCombined }}. Check system power supply in iDRAC.
SNMP_DELL_iDRAC_systemStatePowerUnitStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStatePowerUnitStatus is NOK. Current state is: {{ $labels.systemStatePowerUnitStatusCombined }}. Check system power supply or external power delivery in iDRAC.
SNMP_DELL_iDRAC_systemStateProcessorDeviceStatusCombined_NOK	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateProcessorDeviceStatus is NOK. Current state is: {{ $labels.systemStateProcessorDeviceStatusCombined }}. Check system processor in iDRAC.
SNMP_DELL_iDRAC_systemStateTemperatureStatisticsStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatisticsStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatisticsStatusCombined }}. Check system temperatures in iDRAC.
SNMP_DELL_iDRAC_systemStateTemperatureStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateTemperatureStatus is NOK. Current state is: {{ $labels.systemStateTemperatureStatusCombined }}. Check system temperatures in iDRAC.
SNMP_DELL_iDRAC_systemStateVoltageStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateVoltageStatus is NOK. Current state is: {{ $labels.systemStateVoltageStatusCombined }}. Check system voltage in iDRAC.
SNMP_DELL_iDRAC_systemStateAmperageStatusCombined_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemStateAmperageStatus is NOK. Current state is: {{ $labels.systemStateAmperageStatusCombined }}. Check system voltage in iDRAC.
SNMP_DELL_iDRAC_controllerRollUpStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerRollUpStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerRollUpStatus }}.
SNMP_DELL_iDRAC_controllerComponentStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - controllerComponentStatus is NOK for controllerNumber {{ $labels.controllerNumber }} ( {{ $labels.controllerName }}). Current state is: {{ $labels.controllerComponentStatus }}.
SNMP_DELL_iDRAC_physicalDiskState_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskState is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskState }}.
SNMP_DELL_iDRAC_physicalDiskComponentStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskComponentStatus is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Current state is: {{ $labels.physicalDiskComponentStatus }}.
SNMP_DELL_iDRAC_physicalDiskSmartAlertIndication_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskSmartAlertIndication is NOK for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}).
SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_WARNING	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 40 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }}
SNMP_DELL_iDRAC_physicalDiskRemainingRatedWriteEndurance_CRITICAL	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - physicalDiskRemainingRatedWriteEndurance is less than 20 for physicalDiskNumber {{ $labels.physicalDiskNumber }} ( {{ $labels.physicalDiskDisplayName }}). Value: {{ humanize $value }}
SNMP_DELL_iDRAC_virtualDiskState_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskState is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskState }}.
SNMP_DELL_iDRAC_virtualDiskComponentStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskComponentStatus is NOK for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}). Current state is: {{ $labels.virtualDiskComponentStatus }}.
SNMP_DELL_iDRAC_virtualDiskBadBlocksDetected	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - virtualDiskBadBlocksDetected for virtualDiskNumber {{ $labels.virtualDiskNumber }} ( {{ $labels.virtualDiskDisplayName }}).

alert-rules/allclusters/snmp-hp-cpq.yaml

Alert Name	For	Severity	Type	Description
SNMP_HP_CPQ_Overall_Health_NOK	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Overall health status is NOK. Value: "{{ $labels.cpqHeMibCondition }}"
SNMP_HP_CPQ_Event_Log_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Event Log Condition is NOK. Value: "{{ $labels.cpqHeEventLogCondition }}"}}
SNMP_HP_CPQ_CPU_Health_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - CPU status is NOK. Value: "{{ $labels.cpqSeCpuCondition }}"}}
SNMP_HP_CPQ_Thermal_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Thermal condition status is NOK. Value: "{{ $labels.cpqHeThermalCondition }}"}}
SNMP_HP_CPQ_Power_Supply_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ]- Power supply condition status is NOK. Value: "{{ $labels.cpqHeFltTolPwrSupplyCondition }}"}}
SNMP_HP_CPQ_Storage_Subsystem_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Storage subsystem condition status is NOK. Value: "{{ $labels.cpqSsMibCondition }}"}}
SNMP_HP_CPQ_Controller_Overall_Condition_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - Controller "{{ $labels.cpqDaCntlrIndex }}"}} status is NOK. Value: "{{ $labels.cpqDaCntlrCondition }}"}}. This value represents the overall condition of this controller, and any associated logical drives, physical drives, and array accelerator.
SNMP_HP_CPQ_iLO_LicenseKey_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - HP iLO interface is missing its License activation.

alert-rules/allclusters/snmp-lenovo-xcc.yaml

Alert Name	For	Severity	Type	Description
SNMP_Lenovo_XCC_systemHealthStat_NOK	5m	critical	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - systemHealthStat is not "normal". Current state is: {{ $labels.systemHealthStat }}
SNMP_Lenovo_XCC_cpuVpdHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - cpuVpdHealthStatus for CPU "{{ $labels.cpuVpdDescription }}" is not "normal". Current state is: {{ $labels.cpuVpdHealthStatus }}
SNMP_Lenovo_XCC_raidDriveHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - raidDriveHealthStatus for "{{ $labels.raidDriveName }}" is not "Normal". Current state is: {{ $labels.raidDriveHealthStatus }}
SNMP_Lenovo_XCC_memoryHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - memoryHealthStatus for DIMM "{{ $labels.memoryVpdDescription }}" is not "Normal". Current state is: {{ $labels.memoryHealthStatus }}
SNMP_Lenovo_XCC_fanHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - fanHealthStatus for Fan "{{ $labels.fanDescr }}" is not "Normal". Current state is: {{ $labels.fanHealthStatus }}
SNMP_Lenovo_XCC_voltHealthStatus_NOK	5m	warning	hardware	Target "{{ $labels.alias }}" [ cluster: "{{ $labels.cluster_name }}" / address: "{{ $labels.instance }}" ] - voltHealthStatus for System Component "{{ $labels.voltDescr }}" is not "Normal". Current state is: {{ $labels.voltHealthStatus }}

alert-rules/management-cluster/flux.yaml

Alert Name	For	Severity	Type	Description
Flux_Kustomization_Failing	15m	warning	deployment	Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile.
Flux_Kustomization_Failing_Cluster	60m	warning	deployment	Flux Kustomization "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" fails to reconcile.
Flux_HelmRelease_Failing	15m	warning	deployment	Flux HelmRelease "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile.
Flux_Source_Failing	15m	warning	deployment	Flux Source "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace}}" fails to reconcile.
Flux_Resource_Suspended	2h	warning	deployment	Flux Resource "{{ $labels.name }}" in namespace "{{ $labels.exported_namespace }}" suspended.

alert-rules/management-cluster/health-alerts.yaml

Alert Name	For	Severity	Type	Description
KubeContainerWaitingManagement	1h	critical	k8s	Pod "{{ $labels.namespace }}" / "{{ $labels.pod }}" has been in waiting state for more than 1 hour.

alert-rules/management-cluster/minio.yaml

Alert Name	For	Severity	Type	Description
MinIO_Cluster_Health_Status_NOK	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status not OK.
MinIO_Cluster_Health_Status_Unknown	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" health status is Unknown. The cluster does not return cluster metrics. Check pods logs for error messages.
MinIO_Cluster_Disk_Offline	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" disk offline.
MinIO_Cluster_Disk_Space_Usage	5m	warning	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 30%.
MinIO_Cluster_Disk_Space_Usage	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" available disk space is less than 10%.
MinIO_Cluster_Disk_Space_Will_Fill_Up_Soon	5m	warning	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" at the current rate of utilization the available disk space will run out in the next 2 days.
MinIO_Cluster_Tolerance	5m	critical	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has lost quorum on pool "{{ $labels.pool }}" / set "{{ $labels.set }}" for more than 5 minutes.
MinIO_Nodes_Offline	5m	warning	storage	MinIO cluster "{{ $labels.minio_tenant }}" in namespace "{{ $labels.namespace }}" has offline nodes.

alert-rules/management-cluster/thanos.yaml

Alert Name	For	Severity	Type	Description
ThanosQueryStoreEndpointsMissing	5m	critical	monitoring	Thanos Query is missing "{{ $labels.store_type }}" store type. Metrics served by this store type will not be available which can lead to alerting rules not evaluating properly.
ThanosCompactMultipleRunning	5m	warning	monitoring	More than one Thanos Compact instance is running. Current number of instances: {{ $value }}.
ThanosCompactHalted	5m	warning	monitoring	Thanos Compact {{ $labels.job }} has failed to run and now is halted.
ThanosCompactHighCompactionFailures	15m	warning	monitoring	Thanos Compact {{ $labels.job }} is failing to execute {{ $value
ThanosCompactBucketHighOperationFailures	15m	warning	monitoring	Thanos Compact {{ $labels.job }} Bucket is failing to execute {{ $value
ThanosCompactHasNotRun	5m	warning	monitoring	Thanos Compact {{ $labels.job }} has not uploaded anything for 24 hours.
ThanosQueryHttpRequestQueryErrorRateHigh	5m	critical	monitoring	Thanos Query {{ $labels.job }} is failing to handle {{ $value
ThanosQueryHttpRequestQueryRangeErrorRateHigh	5m	critical	monitoring	Thanos Query {{ $labels.job }} is failing to handle {{ $value
ThanosQueryGrpcServerErrorRate	5m	warning	monitoring	Thanos Query {{ $labels.job }} is failing to handle {{ $value
ThanosQueryGrpcClientErrorRate	5m	warning	monitoring	Thanos Query {{ $labels.job }} is failing to send {{ $value
ThanosQueryHighDNSFailures	15m	warning	monitoring	Thanos Query {{ $labels.job }} have {{ $value
ThanosQueryInstantLatencyHigh	10m	critical	monitoring	Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for instant queries.
ThanosQueryRangeLatencyHigh	10m	critical	monitoring	Thanos Query {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for range queries.
ThanosQueryOverload	15m	warning	monitoring	Thanos Query {{ $labels.job }} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.
ThanosReceiveHttpRequestErrorRateHigh	5m	critical	monitoring	Thanos Receive {{ $labels.job }} is failing to handle {{ $value
ThanosReceiveHttpRequestLatencyHigh	10m	critical	monitoring	Thanos Receive {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for requests.
ThanosReceiveHighReplicationFailures	5m	warning	monitoring	Thanos Receive {{ $labels.job }} is failing to replicate {{ $value
ThanosReceiveHighForwardRequestFailures	5m	info	monitoring	Thanos Receive {{ $labels.job }} is failing to forward {{ $value
ThanosReceiveHighHashringFileRefreshFailures	15m	warning	monitoring	Thanos Receive {{ $labels.job }} is failing to refresh hashring file, {{ $value
ThanosReceiveConfigReloadFailure	5m	warning	monitoring	Thanos Receive {{ $labels.job }} has not been able to reload hashring configurations.
ThanosReceiveNoUpload	3h	critical	monitoring	Thanos Receive {{ $labels.pod }} has not uploaded latest data to object storage.
ThanosReceiveLimitsConfigReloadFailure	5m	warning	monitoring	Thanos Receive {{ $labels.job }} has not been able to reload the limits configuration.
ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate	5m	warning	monitoring	Thanos Receive {{ $labels.job }} is failing for {{ $value
ThanosReceiveTenantLimitedByHeadSeries	5m	warning	monitoring	Thanos Receive tenant {{ $labels.tenant }} is limited by head series.
ThanosStoreGrpcErrorRate	5m	warning	monitoring	Thanos Store {{ $labels.job }} is failing to handle {{ $value
ThanosStoreSeriesGateLatencyHigh	10m	warning	monitoring	Thanos Store {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for store series gate requests.
ThanosStoreBucketHighOperationFailures	15m	warning	monitoring	Thanos Store {{ $labels.job }} Bucket is failing to execute {{ $value
ThanosStoreObjstoreOperationLatencyHigh	10m	warning	monitoring	Thanos Store {{ $labels.job }} Bucket has a 99th percentile latency of {{ $value }} seconds for the bucket operations.
ThanosRuleQueueIsDroppingAlerts	5m	critical	monitoring	Thanos Rule {{ $labels.pod }} is failing to queue alerts.
ThanosRuleSenderIsFailingAlerts	5m	critical	monitoring	Thanos Rule {{ $labels.pod }} is failing to send alerts to alertmanager.
ThanosRuleHighRuleEvaluationFailures	5m	critical	monitoring	Thanos Rule {{ $labels.pod }} is failing to evaluate rules.
ThanosRuleHighRuleEvaluationWarnings	15m	info	monitoring	Thanos Rule {{ $labels.pod }} has high number of evaluation warnings.
ThanosRuleRuleEvaluationLatencyHigh	5m	warning	monitoring	Thanos Rule {{ labels.pod }} has higher evaluation latency than interval for {{labels.rule_group}}.
ThanosRuleGrpcErrorRate	5m	warning	monitoring	Thanos Ruler {{ $labels.pod }} is failing to handle {{ $value
ThanosRuleConfigReloadFailure	5m	info	monitoring	Thanos Ruler {{ $labels.pod }} has not been able to reload its configuration.
ThanosRuleQueryHighDNSFailures	15m	warning	monitoring	Thanos Ruler {{ $labels.pod }} has {{ $value
ThanosRuleAlertmanagerHighDNSFailures	15m	warning	monitoring	Thanos Rule {{ $labels.pod }} has {{ $value
ThanosRuleNoEvaluationFor10Intervals	5m	info	monitoring	Thanos Ruler {{ $labels.pod }} has rule groups that did not evaluate for at least 10x of their expected interval.
ThanosNoRuleEvaluations	5m	critical	monitoring	Thanos Ruler {{ $labels.pod }} did not perform any rule evaluations in the past 10 minutes.
ThanosBucketReplicateErrorRate	5m	critical	monitoring	Thanos Replicate is failing to run, {{ $value
ThanosBucketReplicateRunLatency	5m	critical	monitoring	Thanos Replicate {{ $labels.job }} has a 99th percentile latency of {{ $value }} seconds for the replicate operations.
ThanosCompactIsDown	5m	critical	monitoring	Thanos Compact has disappeared. Prometheus target for the component cannot be discovered.
ThanosQueryIsDown	5m	critical	monitoring	Thanos Query has disappeared. Prometheus target for the component cannot be discovered.
ThanosQueryFrontendIsDown	5m	critical	monitoring	Thanos Query Frontend has disappeared. Prometheus target for the component cannot be discovered.
ThanosReceiveIsDown	5m	critical	monitoring	Thanos Receive has disappeared. Prometheus target for the component cannot be discovered.
ThanosRuleIsDown	5m	critical	monitoring	Thanos Ruler has disappeared. Prometheus target for the component cannot be discovered.
ThanosStoreIsDown	5m	critical	monitoring	Thanos Store has disappeared. Prometheus target for the component cannot be discovered.

alert-rules/my-workload-cluster/health-alerts.yaml

Alert Name	For	Severity	Type	Description
KubeJobFailedWorkload	15m	warning	k8s	Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert.

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻️ Rebasing: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this MR and you won't be reminded about this update again.

If you want to rebase/retry this MR, check this box

This MR has been generated by Renovate Bot Sylva instance.

CI configuration couldn't be handle by MR description. A dedicated comment has been posted to control it.

If no checkbox is checked, a default pipeline will be enabled (capm3, or capo if capo label is set)

Update dependency https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-prometheus-rules.git to v0.0.17 (release-1.3)

Release Notes

v0.0.17: sylva-prometheus-rules: 0.0.17

Merge Requests integrated in this release

CI

Other

Contributors

sylva-prometheus-rules

Overview

Rules overrides

Rules activation

alert-rules/allclusters/health-alerts.yaml

alert-rules/allclusters/snmp-dell-idrac.yaml

alert-rules/allclusters/snmp-hp-cpq.yaml

alert-rules/allclusters/snmp-lenovo-xcc.yaml

alert-rules/management-cluster/flux.yaml

alert-rules/management-cluster/harbor.yaml

alert-rules/management-cluster/health-alerts.yaml

alert-rules/management-cluster/minio.yaml

alert-rules/management-cluster/thanos.yaml

alert-rules/my-workload-cluster/health-alerts.yaml

v0.0.16: sylva-prometheus-rules: 0.0.16

Merge Requests integrated in this release

Monitoring & logging

CI

Other

Contributors

sylva-prometheus-rules

Overview

Rules overrides

Rules activation

alert-rules/allclusters/health-alerts.yaml

alert-rules/allclusters/snmp-dell-idrac.yaml

alert-rules/allclusters/snmp-hp-cpq.yaml

alert-rules/allclusters/snmp-lenovo-xcc.yaml

alert-rules/management-cluster/flux.yaml

alert-rules/management-cluster/health-alerts.yaml

alert-rules/management-cluster/minio.yaml

alert-rules/management-cluster/thanos.yaml

alert-rules/my-workload-cluster/health-alerts.yaml

Configuration

Merge request reports

`v0.0.17`: sylva-prometheus-rules: 0.0.17

`v0.0.16`: sylva-prometheus-rules: 0.0.16