CI: revisit job checking monitoring alerts (!6575) · Merge requests · Sylva-projects / sylva-core

This MR refactors the mgmt-thanos-alert.

The main evolution is to have this job check alerts that fired during the whole run instead of only looking at alerts that are currently failing at the moment where that CI job runs.

Other evolutions:

use the promql CLI tool to give a nice output letting us easily see at which moment an alert fired ; example

########################################################################################################################################
# TIME_RANGE: Jan 13 21:31:09 -> Jan 13 21:31:09                                                                                                                                                                 #
# METRIC: ALERTS{alertname="Thanos-Ruler_Rule_Evaluation_Failures_Rate_High", alertstate="firing", cluster="mgmt-2261038095-kubeadm-capm3-virt", job="thanos-ruler", platform_tag="Sylva", pod="thanos #
########################################################################################################################################

list all alerts not only the ones that are severity="critical"
produce artifacts both with the text output (example above) and the JSON output
do a thanos sanity check by checking that the Watchdog alert failed continuously in the last 30 minutes -- this is meant to avoid having this job gives a misleading "success" result in a case where Thanos would be misbehaving of misconfigured

What does not change (but might in some distant future): the criteria for failing is the same as before, and considers only the alerts that are failing at the end of a run. The job does not fail if some critical alert fired during the run but stopped firing. The reason for this is that we would have this job constantly fail if we changed the criteria to fail the job on any critical alert triggering during the run; we have too many such alerts that fire. We need to either fix them or silent them before we can make the criteria change.

Example run: https://gitlab.com/sylva-projects/sylva-core/-/jobs/12718213016

This MR depends on:

sylva-projects/sylva-elements/container-images/ci-image!360 (merged) to use promql CLI tool

This MR was initially done in the context of !6523.

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon	Meaning	Available values
☁️	Infra Provider	`capd`, `capo`, `capm3`
🚀	Bootstrap Provider	`kubeadm` (alias `kadm`), `rke2`, `okd`, `ck8s`
🐧	Node OS	`ubuntu`, `suse`, `na`, `leapmicro`
🛠️	Deployment Options	`light-deploy`, `dev-sources`, `ha`, `misc`, `maxsurge-0`, `logging`, `no-logging`, `cilium`
🎬	Pipeline Scenarios	Available scenario list and description
🟢	Enabled units	Any available units name, by default apply to management and workload cluster. Can be prefixed by `mgmt:` or `wkld:` to be applied only to a specific cluster type
🏗️	Target platform	Can be used to select specific deployment environment (i.e `real-bmh` for capm3 )

Global config for deployment pipelines

autorun pipelines
allow failure on pipelines
record sylvactl events

Notes:

Enabling autorun will make deployment pipelines to be run automatically without human interaction
Disabling allow failure will make deployment pipelines mandatory for pipeline success.
if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited Jan 15, 2026 by Thomas Morin

CI: revisit job checking monitoring alerts

CI configuration

Global config for deployment pipelines

Merge request reports