Install v1 BackendConfig CRD into the ops Kubernetes cluster
Production Change - Criticality 4 C4
| Change Component | Description |
|---|---|
| Change Objective | Describe the objective of the change |
| Change Type | Operation |
| Services Impacted | List services |
| Change Team Members | @craigf, @mwasilewski-gitlab |
| Change Criticality | C4 |
| Change Reviewer or tested in staging | @jarv to review |
| Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
| Due Date | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
| Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Context
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10731
Detailed steps for the change
-
Remove the monitoring releases from ops to remove instances of BackendConfig: -
sshuttle -r console-01-sv-gstg.c.gitlab-staging-1.internal '104.196.154.62/32' -
In another terminal, change your kube target to ops: kctx ops -
helm tiller run helm delete gitlab-monitoring --purge -
helm tiller run helm delete gitlab-monitoring-secrets --purge
-
-
Download the v1 CRD manifest from a healthy cluster (pre): kubectl --context pre get crd backendconfigs.cloud.google.com -o yaml -
Tidy the manifest, removing fields that are disallowed on input ( status,metadata.uid, etc) -
Targetting the ops cluster, apply the CRD -
sshuttle -r console-01-sv-gstg.c.gitlab-staging-1.internal '104.196.154.62/32' -
In another terminal, change your kube target to ops: kctx ops -
kubectl apply -f tidied-crd-manifest.yaml
-
-
Verify: kubectl --context pre get crd backendconfigs.cloud.google.com -o yaml
Rollback steps
We could delete the CRD (kubectl delete crd backendconfigs.cloud.google.com), but I'm not sure under what circumstances that would be desirable. Given that this change issue is a rescue mission for the ops cluster, the next logical step might be to rebuild that cluster.
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents
Edited by Craig Furman