CI: revisit job checking monitoring alerts
This MR refactors the mgmt-thanos-alert.
The main evolution is to have this job check alerts that fired during the whole run instead of only looking at alerts that are currently failing at the moment where that CI job runs.
Other evolutions:
-
use the
promqlCLI tool to give a nice output letting us easily see at which moment an alert fired ; example######################################################################################################################################## # TIME_RANGE: Jan 13 21:31:09 -> Jan 13 21:31:09 # # METRIC: ALERTS{alertname="Thanos-Ruler_Rule_Evaluation_Failures_Rate_High", alertstate="firing", cluster="mgmt-2261038095-kubeadm-capm3-virt", job="thanos-ruler", platform_tag="Sylva", pod="thanos # ######################################################################################################################################## -
list all alerts not only the ones that are
severity="critical" -
produce artifacts both with the text output (example above) and the JSON output
-
do a thanos sanity check by checking that the
Watchdogalert failed continuously in the last 30 minutes -- this is meant to avoid having this job gives a misleading "success" result in a case where Thanos would be misbehaving of misconfigured
What does not change (but might in some distant future): the criteria for failing is the same as before, and considers only the alerts that are failing at the end of a run. The job does not fail if some critical alert fired during the run but stopped firing. The reason for this is that we would have this job constantly fail if we changed the criteria to fail the job on any critical alert triggering during the run; we have too many such alerts that fire. We need to either fix them or silent them before we can make the criteria change.
Example run: https://gitlab.com/sylva-projects/sylva-core/-/jobs/12718213016
This MR depends on:
-
sylva-projects/sylva-elements/container-images/ci-image!360 (merged) to use
promqlCLI tool
This MR was initially done in the context of !6523.
CI configuration
Below you can choose test deployment variants to run in this MR's CI.
Click to open to CI configuration
Legend:
| Icon | Meaning | Available values |
|---|---|---|
| Infra Provider |
capd, capo, capm3
|
|
| Bootstrap Provider |
kubeadm (alias kadm), rke2, okd, ck8s
|
|
| Node OS |
ubuntu, suse, na, leapmicro
|
|
| Deployment Options |
light-deploy, dev-sources, ha, misc, maxsurge-0, logging, no-logging, cilium
|
|
| Pipeline Scenarios | Available scenario list and description | |
| Enabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
|
| Target platform | Can be used to select specific deployment environment (i.e real-bmh for capm3 ) |
-
🎬 preview☁️ capd🚀 kadm🐧 ubuntu -
🎬 preview☁️ capo🚀 rke2🐧 suse -
🎬 preview☁️ capm3🚀 rke2🐧 ubuntu -
☁️ capd🚀 kadm🛠️ light-deploy🐧 ubuntu -
☁️ capd🚀 rke2🛠️ light-deploy🐧 suse -
☁️ capo🚀 rke2🐧 suse -
☁️ capo🚀 rke2🐧 leapmicro -
☁️ capo🚀 kadm🐧 ubuntu -
☁️ capo🚀 kadm🐧 ubuntu🟢 neuvector,mgmt:harbor -
☁️ capo🚀 rke2🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capo🚀 kadm🎬 wkld-k8s-upgrade🐧 ubuntu -
☁️ capo🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capo🚀 rke2🎬 sylva-upgrade🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.6.x🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc,openbao🐧 suse -
☁️ capo🚀 rke2🐧 suse🎬 upgrade-from-prev-tag -
☁️ capm3🚀 rke2🐧 suse -
☁️ capm3🚀 kadm🐧 ubuntu -
☁️ capm3🚀 ck8s🐧 ubuntu -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha,misc🐧 ubuntu -
☁️ capm3🚀 rke2🎬 wkld-k8s-upgrade🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🎬 upgrade-from-prev-release-branch🛠️ ha🐧 suse -
☁️ capm3🚀 rke2🛠️ misc,ha🐧 suse -
☁️ capm3🚀 rke2🎬 sylva-upgrade🛠️ ha,misc🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 suse -
☁️ capm3🚀 ck8s🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2|okd🎬 no-update🐧 ubuntu|na -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-from-release-1.5 -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-to-main
Global config for deployment pipelines
- autorun pipelines
- allow failure on pipelines
- record sylvactl events
Notes:
- Enabling
autorunwill make deployment pipelines to be run automatically without human interaction - Disabling
allow failurewill make deployment pipelines mandatory for pipeline success. - if both
autorunandallow failureare disabled, deployment pipelines will need manual triggering but will be blocking the pipeline
Be aware: after configuration change, pipeline is not triggered automatically.
Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.