Skip to content

Revert "Merge branch '13973-upgrade-monitoring-helm-charts-non-gprd' into 'master'"

Steve Xuereb requested to merge revert-af9344dd into master

Background

In gitlab-com/gl-infra/production#5711 (closed) we upgraded the Prometheus operator from v0.42.1 to 0.50.0, that's almost 1 year of changes! We upgraded this to pre first which everything seems fine, then moved onto the rest of the environments excluding gprd right now. The upgrades where successful apart from the ops environment, because we have alertmanager present in that environment (we only have 1 cluster of alert managers)

Problem

When we looked at the logs of the operator with kubectl -n monitoring logs gitlab-monitoring-promethe-operator-7646fbfc78-qkwqr connected to the ops-gitlab-gke cluster we started seeing the following errors:

{"caller":"klog.go:116","component":"k8s_client_runtime","func":"ErrorDepth","level":"error","msg":"Sync \"monitoring/gitlab-monitoring-promethe-alertmanager\" failed: provision alertmanager configuration: base config from Secret could not be parsed: yaml: unmarshal errors:\n  line 441: field matchers not found in type alertmanager.route\n  line 450: field matchers not found in type alertmanager.route\n  line 459: field matchers not found in type alertmanager.route\n  line 468: field matchers not found in type alertmanager.route\n  line 477: field matchers not found in type alertmanager.route\n  line 486: field matchers not found in type alertmanager.route\n  line 495: field matchers not found in type alertmanager.route\n  line 504: field matchers not found in type alertmanager.route\n  line 513: field matchers not found in type alertmanager.route\n  line 522: field matchers not found in type alertmanager.route\n  line 531: field matchers not found in type alertmanager.route\n  line 540: field matchers not found in type alertmanager.route\n  line 549: field matchers not found in type alertmanager.route\n  line 558: field matchers not found in type alertmanager.route\n  line 567: field matchers not found in type alertmanager.route\n  line 576: field matchers not found in type alertmanager.route\n  line 584: field matchers not found in type alertmanager.route\n  line 592: field matchers not found in type alertmanager.route\n  line 600: field matchers not found in type alertmanager.route\n  line 608: field matchers not found in type alertmanager.route\n  line 616: field matchers not found in type alertmanager.route\n  line 622: field matchers not found in type alertmanager.route\n  line 629: field matchers not found in type alertmanager.route\n  line 635: field matchers not found in type alertmanager.route\n  line 641: field matchers not found in type alertmanager.route\n  line 649: field matchers not found in type alertmanager.route\n  line 656: field matchers not found in type alertmanager.route\n  line 663: field matchers not found in type alertmanager.route\n  line 670: field matchers not found in type alertmanager.route\n  line 675: field matchers not found in type alertmanager.route\n  line 680: field matchers not found in type alertmanager.route\n  line 685: field matchers not found in type alertmanager.route\n  line 690: field matchers not found in type alertmanager.route\n  line 695: field matchers not found in type alertmanager.route\n  line 700: field matchers not found in type alertmanager.route\n  line 705: field matchers not found in type alertmanager.route\n  line 710: field matchers not found in type alertmanager.route\n  line 715: field matchers not found in type alertmanager.route\n  line 720: field matchers not found in type alertmanager.route\n  line 725: field matchers not found in type alertmanager.route\n  line 730: field matchers not found in type alertmanager.route\n  line 735: field matchers not found in type alertmanager.route\n  line 740: field matchers not found in type alertmanager.route\n  line 745: field matchers not found in type alertmanager.route\n  line 750: field matchers not found in type alertmanager.route\n  line 755: field matchers not found in type alertmanager.route\n  line 760: field matchers not found in type alertmanager.route\n  line 765: field matchers not found in type alertmanager.route\n  line 770: field matchers not found in type alertmanager.route\n  line 775: field matchers not found in type alertmanager.route\n  line 780: field matchers not found in type alertmanager.route\n  line 785: field matchers not found in type alertmanager.route\n  line 790: field matchers not found in type alertmanager.route\n  line 795: field matchers not found in type alertmanager.route\n  line 800: field matchers not found in type alertmanager.route\n  line 804: field matchers not found in type alertmanager.route\n  line 808: field matchers not found in type alertmanager.route\n  line 812: field matchers not found in type alertmanager.route","ts":"2021-10-13T13:03:35.909536166Z"}

What is happening

@mwasilewski-gitlab and I ended up looking at the source code of the operator to understand what is happening, and we went through the following stack trace:

  1. provision alertmanager configuration error.
  2. This calls loadCfg
  3. loadCfg does two things, it validates AlertManagers configuration from upstream and then tries to marshal it with it's own data strcuture.

Now if we look at the error unmarshal errors:\n line 441: field matchers not found in type alertmanager.route this means that Go tried to marshal the YAML string into a Go structure.

  1. We validated if the configuration is good, but getting the secret with kubectl -n monitoring get secrets alertmanager-config -o json | jq '.data["alertmanager.yaml"]' -r | base64 --decode and then using amtool and it's valid.
  2. We looked at the alertmanager Go struct and it seems to match our matchers inside of the configuration.
  3. We looked at the Operators alertmanager Go struct config and we don't see the matchers field there (BINGO). For some reason the operator has it's own struct for alert manager configuraiton.

So the problem here is that the Prometheus Operator has its own understanding of how the alert manager configuration should look like and it's actually different then what alert manager thinks is valid configuration

Why didn't this fail before

Looking at 0.42.1 the validation didn't exist at all for alertmanager so it never had this problem where the operator has a different understanding of what valid configuraiton looks like, it just provisions it.

Waiting for upstream patches

Pull Requests

Issues

Action Items

I would like to see why the the Prometheus operator is duplicating the config structure, if we can upstream a patch to use alertsmanager config struct directly (it's already a dependecy) this would solve future problems, making our upgrades more reliable. This is going to be done in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973#note_703689478

Edited by Steve Xuereb

Merge request reports