Skip to content

2021-10-18: Update Prometheus Operator in pre

Production Change

Change Summary

For environments pre part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973:

Change Details

  1. Services Impacted - ServicePrometheus
  2. Change Technician - @steveazz
  3. Change Reviewer - @igorwwwwwwwwwwwwwwwwwwww
  4. Time tracking - unknown
  5. Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 10

  • Upgrade CRDs in pre inside pre-gitlab-gke

    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
    kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/0a38647379a5e93f639bf8e634deabcc32e01fb6/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml
  • Re-run ops pipeline for gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!499 (merged)

  • Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!499 (merged)

  • Wait for apply to finish. 👉 https://ops.gitlab.net/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/pipelines/846860

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 2

  • Verify that all pods have restarted: kubectl -n monitoring get po apart from thanos, memcached, gitaly-exporter
  • Verify that the new operator version is running: kubectl -n monitoring get po gitlab-monitoring-promethe-operator-7dc8f7b879-4dk88 -o json | jq .spec.containers[0].image expected value is ghcr.io/prometheus-operator/prometheus-operator:master@sha256:bb79240165868c7d73d3db2b45bd065bf2b3050729aa4809f6de79cace232feb
  • Take a look at the operator logs and check if there are any error level logs: kubectl -n monitoring logs gitlab-monitoring-promethe-operator-7dc8f7b879-4dk88 --since=5m. If there is a large amount of logs you can filter for error level kubectl -n monitoring logs gitlab-monitoring-promethe-operator-7dc8f7b879-4dk88 --since=5m | grep 'err'
  • Verify that service discovery is working
    • Get DNS: prom=$(kubectl -n monitoring get svc prometheus-headless -o json | jq '.metadata.annotations["external-dns.alpha.kubernetes.io/hostname"]' -r)
    • Get metric: curl -s -L $prom:9090/metrics | grep 'scrape_pool_targets'

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.