Install v1 BackendConfig CRD into the ops Kubernetes cluster

Production Change - Criticality 4 C4

Change Component Description
Change Objective Describe the objective of the change
Change Type Operation
Services Impacted List services
Change Team Members @craigf, @mwasilewski-gitlab
Change Criticality C4
Change Reviewer or tested in staging @jarv to review
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change
Time tracking To estimate and record times associated with changes ( including a possible rollback )

Context

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10731

Detailed steps for the change

  • Remove the monitoring releases from ops to remove instances of BackendConfig:
    • sshuttle -r console-01-sv-gstg.c.gitlab-staging-1.internal '104.196.154.62/32'
    • In another terminal, change your kube target to ops: kctx ops
    • helm tiller run helm delete gitlab-monitoring --purge
    • helm tiller run helm delete gitlab-monitoring-secrets --purge
  • Download the v1 CRD manifest from a healthy cluster (pre): kubectl --context pre get crd backendconfigs.cloud.google.com -o yaml
  • Tidy the manifest, removing fields that are disallowed on input (status, metadata.uid, etc)
  • Targetting the ops cluster, apply the CRD
    • sshuttle -r console-01-sv-gstg.c.gitlab-staging-1.internal '104.196.154.62/32'
    • In another terminal, change your kube target to ops: kctx ops
    • kubectl apply -f tidied-crd-manifest.yaml
  • Verify: kubectl --context pre get crd backendconfigs.cloud.google.com -o yaml

Rollback steps

We could delete the CRD (kubectl delete crd backendconfigs.cloud.google.com), but I'm not sure under what circumstances that would be desirable. Given that this change issue is a rescue mission for the ops cluster, the next logical step might be to rebuild that cluster.

Monitoring

Key metrics to observe

  • Metric: Metric Name
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Describe Changes

Summary of infrastruture changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • SRE on-call has been informed prior to change being rolled out
  • There are currently no active incidents
Edited by Craig Furman