Skip to content

fix(prdsub): global env labels

Steve Xuereb requested to merge fix/prdsub-labels into master

Background

Currently we have the following prometheus configuration:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    cluster: prdsub-customers-gke
    env: prdsub
    environment: stgsub
    monitor: default
    prometheus: monitoring/gitlab-monitoring-promethe-prometheus
    prometheus_replica: prometheus-gitlab-monitoring-promethe-prometheus-0
    provider: gcp
    region: us-east1

Notice how env="prdsub" and environment="prdsub". This doesn't follow our label taxanomy in https://gitlab.com/gitlab-com/runbooks/-/tree/master/libsonnet/label-taxonomy where it assumes it's going to be gstg or gprd.

This results into thanos-store and thanos-sidecar metrics to be exposed with the wrong labels as we see in gitlab-com/gl-infra&696 (comment 875284133)

Solution

Specify the monitoring_env explictly for prdsub so that we set this to gprd which checks checked in values

This was tested in stgsub first in !629 (merged) which worked as expected

ref: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15442

Rollout

We can't simply merge this since we have metamonitoring around the labels as we've learned in gitlab-com/gl-infra/production#6611 (closed)

  1. Puase the deadmansnitch with the name Prometheus - GKE prdsub-customers-gke so we don't page EOC.

  2. Merge this.

  3. Update the ALERTMANAGER_SECRETS_FILE variable in https://ops.gitlab.net/gitlab-com/runbooks/-/settings/ci_cd from

    { name: 'prdsub', apiKey: 'xxxx', cluster: 'prdsub-customers-gke'},

    to

    { name: 'gprd', apiKey: 'xxxx', cluster: 'prdsub-customers-gke'},
  4. Run pipeline on master branch in https://ops.gitlab.net/gitlab-com/runbooks/-/pipelines/new so we update alert manager configuration

  5. Confirm alertmanager configuration https://alerts.gitlab.net/#/status is updated like below

    - receiver: dead_mans_snitch_prdsub_prdsub-customers-gke
     matchers:
     - alertname="SnitchHeartBeat"
     - cluster="prdsub-customers-gke"
     - env="gprd"
     continue: false
     group_wait: 1m
     group_interval: 5m
     repeat_interval: 5m
  6. Confirm that a heartbeat was sent to deadmansnitch.

  7. Unpause the deadmansnitch.

Signed-off-by: Steve Azzopardi sazzopardi@gitlab.com

Edited by Steve Xuereb

Merge request reports