Enable monitoring and Grafana BGP dashboard for MetalLB frr-k8s mode

What does this MR do and why?

Context

Currently, MetalLB monitoring and alerting is only fully supported in native BGP mode. When running MetalLB in frr-k8s mode, BGP metrics are exposed with a different prefix (frrk8s_bgp_*), which causes existing dashboards and alert rules to miss relevant data. Additionally, ServiceMonitor resources can cause deployment errors if the required CRDs are not present.

Goals

  • Ensure that Prometheus can scrape BGP metrics from frr-k8s pods.
  • Adapt the MetalLB Grafana dashboard to visualize BGP metrics from both native and frr-k8s modes.
  • Update Thanos/Prometheus alert rules to support both metric types.
  • Prevent ServiceMonitor deployment errors in clusters without the required CRDs.

Tasks

  1. Prometheus Integration:
    Configure Prometheus to scrape BGP-related metrics from frr-k8s pods by setting the appropriate Helm values.
  2. Grafana Dashboard:
    Update the MetalLB Grafana dashboard to support both metallb_bgp_session_up and frrk8s_bgp_session_up metrics.
  3. Alert Rules:
    Patch the MetalLB BGP alert rules in Thanos/Prometheus to include both metric types, ensuring alerts fire in both modes.
  4. ServiceMonitor Handling:
    Adapt Helm values and deployment logic to disable ServiceMonitor creation when the CRD is not present, avoiding installation errors.

Acceptance Criteria

  • BGP metrics from frr-k8s are visible in Prometheus and Grafana.
  • The MetalLB dashboard displays BGP session status for both native and frr-k8s modes.
  • BGP alert rules trigger correctly for both metric types.
  • No ServiceMonitor-related errors occur during deployment in clusters without the CRD.

Closes #3896 (closed)

Related to:

Test coverage

Tested in the UIs, in a windows vm, Grafana, Prometheus, Thanos. Tested in the capo CI pipeline. I made a fresh deployment for testing with the changes applied and is not getting stuck at first node that's being deployed anymore.

Example of values to enable frr-k8s:

metallb:
  bgp_lbs:
    l3_options:
      bgp_peers:
        ext-router1:
          local_asn: 64513
          peer_asn: 64513
          peer_address: 172.20.219.241
          advertised_pools:
            - pool1
          receive_routes:
            mode: all
    address_pools:
      pool1:
        addresses:
          - 192.168.1.1-192.168.1.2

Testing in the ci with the misc deployment option for the metallb values.

crustgather-job-14266725387 ~> flux debug hr metallb --show-values | yq .frr-k8s.prometheus
namespace: cattle-monitoring-system
rbacPrometheus: true
rbacProxy:
  repository: quay.io/brancz/kube-rbac-proxy
  tag: v0.18.1
serviceAccount: rancher-monitoring-prometheus
serviceMonitor:
  enabled: true

CI configuration

Below you can choose test deployment variants to run in this MR's CI.

Click to open to CI configuration

Legend:

Icon Meaning Available values
☁️ Infra Provider capd, capo, capm3
🚀 Bootstrap Provider kubeadm (alias kadm), rke2, okd, ck8s
🐧 Node OS ubuntu, suse, na, leapmicro
🛠️ Deployment Options Deployment option list and description
🎬 Pipeline Scenarios Available scenario list and description
🟢 Enabled units Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type
🔴 Disabled units Any available units name, by default apply to management and workload cluster. Can be prefixed by mgmt: or wkld: to be applied only to a specific cluster type
🏗️ Target platform Can be used to select specific deployment environment Available platform list and description
  • 🎬 preview ☁️ capd 🚀 kadm 🐧 ubuntu

  • 🎬 preview ☁️ capo 🚀 rke2 🐧 suse

  • 🎬 preview ☁️ capm3 🚀 rke2 🐧 ubuntu

  • ☁️ capd 🚀 kadm 🛠️ light-deploy 🐧 ubuntu

  • ☁️ capd 🚀 rke2 🛠️ light-deploy 🐧 suse

  • ☁️ capo 🚀 rke2 🐧 suse 🛠️ misc

  • ☁️ capo 🚀 rke2 🐧 leapmicro

  • ☁️ capo 🚀 kadm 🐧 ubuntu

  • ☁️ capo 🚀 kadm 🐧 ubuntu 🟢 neuvector,mgmt:harbor

  • ☁️ capo 🚀 rke2 🎬 rolling-update 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 kadm 🎬 wkld-k8s-upgrade 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 rolling-update-no-wkld 🛠️ ha 🐧 suse

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade 🛠️ ha 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🎬 sylva-upgrade-from-1.6.x 🛠️ ha,misc🐧 ubuntu

  • ☁️ capo 🚀 rke2 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capo 🚀 rke2 🛠️ misc 🐧 ubuntu 🟢 mgmt:harbor 🔴 neuvector

  • ☁️ capo 🚀 rke2 🛠️ ha,misc,openbao🐧 suse

  • ☁️ capo 🚀 rke2 🐧 suse 🎬 upgrade-from-prev-tag

  • ☁️ capm3 🚀 rke2 🐧 suse

  • ☁️ capm3 🚀 kadm 🐧 ubuntu

  • ☁️ capm3 🚀 ck8s 🐧 ubuntu

  • ☁️ capm3 🚀 kadm 🎬 rolling-update-no-wkld 🛠️ ha,misc 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 wkld-k8s-upgrade 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2 🎬 upgrade-from-prev-release-branch 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🛠️ misc,ha 🐧 suse

  • ☁️ capm3 🚀 rke2 🎬 sylva-upgrade 🛠️ ha,misc 🐧 suse

  • ☁️ capm3 🚀 kadm 🎬 rolling-update 🛠️ ha 🐧 suse

  • ☁️ capm3 🚀 ck8s 🎬 rolling-update 🛠️ ha 🐧 ubuntu

  • ☁️ capm3 🚀 rke2|okd 🎬 no-update 🐧 ubuntu|na

  • ☁️ capm3 🚀 rke2 🐧 suse 🎬 upgrade-from-release-1.5

  • ☁️ capm3 🚀 rke2 🐧 suse 🎬 upgrade-to-main

Global config for deployment pipelines

  • autorun pipelines
  • allow failure on pipelines
  • record sylvactl events

Notes:

  • Enabling autorun will make deployment pipelines to be run automatically without human interaction
  • Disabling allow failure will make deployment pipelines mandatory for pipeline success.
  • if both autorun and allow failure are disabled, deployment pipelines will need manual triggering but will be blocking the pipeline

Be aware: after configuration change, pipeline is not triggered automatically. Please run it manually (by clicking the run pipeline button in Pipelines tab) or push new code.

Edited by Andra-Simona Delicostea

Merge request reports

Loading