Enable monitoring and Grafana BGP dashboard for MetalLB frr-k8s mode
What does this MR do and why?
Context
Currently, MetalLB monitoring and alerting is only fully supported in native BGP mode. When running MetalLB in frr-k8s mode, BGP metrics are exposed with a different prefix (frrk8s_bgp_*), which causes existing dashboards and alert rules to miss relevant data. Additionally, ServiceMonitor resources can cause deployment errors if the required CRDs are not present.
Goals
- Ensure that Prometheus can scrape BGP metrics from frr-k8s pods.
- Adapt the MetalLB Grafana dashboard to visualize BGP metrics from both native and frr-k8s modes.
- Update Thanos/Prometheus alert rules to support both metric types.
- Prevent ServiceMonitor deployment errors in clusters without the required CRDs.
Tasks
- Prometheus Integration:
Configure Prometheus to scrape BGP-related metrics from frr-k8s pods by setting the appropriate Helm values. - Grafana Dashboard:
Update the MetalLB Grafana dashboard to support bothmetallb_bgp_session_upandfrrk8s_bgp_session_upmetrics. - Alert Rules:
Patch the MetalLB BGP alert rules in Thanos/Prometheus to include both metric types, ensuring alerts fire in both modes. - ServiceMonitor Handling:
Adapt Helm values and deployment logic to disable ServiceMonitor creation when the CRD is not present, avoiding installation errors.
Acceptance Criteria
- BGP metrics from frr-k8s are visible in Prometheus and Grafana.
- The MetalLB dashboard displays BGP session status for both native and frr-k8s modes.
- BGP alert rules trigger correctly for both metric types.
- No ServiceMonitor-related errors occur during deployment in clusters without the CRD.
Related reference(s)
Closes #3896 (closed)
Related to:
- sylva-projects/sylva-elements/helm-charts/sylva-dashboards!146 (merged)
- sylva-projects/sylva-elements/helm-charts/sylva-thanos-rules!134 (merged)
Test coverage
Tested in the UIs, in a windows vm, Grafana, Prometheus, Thanos. Tested in the capo CI pipeline. I made a fresh deployment for testing with the changes applied and is not getting stuck at first node that's being deployed anymore.
Example of values to enable frr-k8s:
metallb:
bgp_lbs:
l3_options:
bgp_peers:
ext-router1:
local_asn: 64513
peer_asn: 64513
peer_address: 172.20.219.241
advertised_pools:
- pool1
receive_routes:
mode: all
address_pools:
pool1:
addresses:
- 192.168.1.1-192.168.1.2Testing in the ci with the misc deployment option for the metallb values.
crustgather-job-14266725387 ~> flux debug hr metallb --show-values | yq .frr-k8s.prometheus
namespace: cattle-monitoring-system
rbacPrometheus: true
rbacProxy:
repository: quay.io/brancz/kube-rbac-proxy
tag: v0.18.1
serviceAccount: rancher-monitoring-prometheus
serviceMonitor:
enabled: trueCI configuration
Below you can choose test deployment variants to run in this MR's CI.
Click to open to CI configuration
Legend:
| Icon | Meaning | Available values |
|---|---|---|
| Infra Provider | capd, capo, capm3 |
|
| Bootstrap Provider | kubeadm (alias kadm), rke2, okd, ck8s |
|
| Node OS | ubuntu, suse, na, leapmicro |
|
| Deployment Options | Deployment option list and description | |
| Pipeline Scenarios | Available scenario list and description | |
| Enabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
|
| Disabled units | Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type |
|
| Target platform | Can be used to select specific deployment environment Available platform list and description |
-
🎬 preview☁️ capd🚀 kadm🐧 ubuntu -
🎬 preview☁️ capo🚀 rke2🐧 suse -
🎬 preview☁️ capm3🚀 rke2🐧 ubuntu -
☁️ capd🚀 kadm🛠️ light-deploy🐧 ubuntu -
☁️ capd🚀 rke2🛠️ light-deploy🐧 suse -
☁️ capo🚀 rke2🐧 suse🛠️ misc -
☁️ capo🚀 rke2🐧 leapmicro -
☁️ capo🚀 kadm🐧 ubuntu -
☁️ capo🚀 kadm🐧 ubuntu🟢 neuvector,mgmt:harbor -
☁️ capo🚀 rke2🎬 rolling-update🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 kadm🎬 wkld-k8s-upgrade🐧 ubuntu -
☁️ capo🚀 rke2🎬 rolling-update-no-wkld🛠️ ha🐧 suse -
☁️ capo🚀 rke2🎬 sylva-upgrade🛠️ ha🐧 ubuntu -
☁️ capo🚀 rke2🎬 sylva-upgrade-from-1.6.x🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ ha,misc🐧 ubuntu -
☁️ capo🚀 rke2🛠️ misc🐧 ubuntu🟢 mgmt:harbor🔴 neuvector -
☁️ capo🚀 rke2🛠️ ha,misc,openbao🐧 suse -
☁️ capo🚀 rke2🐧 suse🎬 upgrade-from-prev-tag -
☁️ capm3🚀 rke2🐧 suse -
☁️ capm3🚀 kadm🐧 ubuntu -
☁️ capm3🚀 ck8s🐧 ubuntu -
☁️ capm3🚀 kadm🎬 rolling-update-no-wkld🛠️ ha,misc🐧 ubuntu -
☁️ capm3🚀 rke2🎬 wkld-k8s-upgrade🛠️ ha🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2🎬 upgrade-from-prev-release-branch🛠️ ha🐧 suse -
☁️ capm3🚀 rke2🛠️ misc,ha🐧 suse -
☁️ capm3🚀 rke2🎬 sylva-upgrade🛠️ ha,misc🐧 suse -
☁️ capm3🚀 kadm🎬 rolling-update🛠️ ha🐧 suse -
☁️ capm3🚀 ck8s🎬 rolling-update🛠️ ha🐧 ubuntu -
☁️ capm3🚀 rke2|okd🎬 no-update🐧 ubuntu|na -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-from-release-1.5 -
☁️ capm3🚀 rke2🐧 suse🎬 upgrade-to-main
Global config for deployment pipelines
- autorun pipelines
- allow failure on pipelines
- record sylvactl events
Notes:
- Enabling
autorunwill make deployment pipelines to be run automatically without human interaction - Disabling
allow failurewill make deployment pipelines mandatory for pipeline success. - if both
autorunandallow failureare disabled, deployment pipelines will need manual triggering but will be blocking the pipeline
Be aware: after configuration change, pipeline is not triggered automatically.
Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.