2022-11-08: Add manual cert for dev.gitlab.org registry
Production Change
Change Summary
This change is a part of https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14554#note_1163175782 and is related to the cert expiry alert here: #7992 (closed).
As part of https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14554 we're looking to automate the rotation of certs via the certbot found in the omnibus-gitlab practice.
However, with the certificate expiring in a few days (#7992 (closed)), we have decided to manually generate a cert via Let's Encrypt that will give us a 3 month time extension to get automation in place. Since our Kubernetes clusters relies on pulling images from this registry, we can't have any lapse in SSL for the pulls.
So to be on the safe side, we're adding a manual cert in this change, until automation that automatically renews certs in place. This cert will be a DV cert issued by Let's Encrypt and uploaded to the chef-vault.
Cert snippet:
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
04:9d:4e:74:e1:5f:97:0e:0f:32:97:c5:e2:17:1f:30:de:a3
Signature Algorithm: sha256WithRSAEncryption
Issuer: C=US, O=Let's Encrypt, CN=R3
Validity
Not Before: Nov 8 02:20:58 2022 GMT
Not After : Feb 6 02:20:57 2023 GMT
Subject: CN=dev.gitlab.org
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)
Modulus:
Why is this a C2 change?
Despite its name, we actually use dev.gitlab.org's registry for all our production container pulls in Kubernetes (and elsewhere). If there is an issue with swapping out the certificates:
- Pods will fail to restart/redeploy due to failed container image pulls, leading to an incremental service outage across all clusters, and all pods incrementally. Update: We have an
ImagePullPolicy:ifNotPresentso the issue will be seen when we deploy new versions of the image to the registry, and the containers will fail trying to pull updated images.
Here is an example of webservice-web deployment object in our Production cluster:
kubectl -n gitlab describe deployment gitlab-webservice-web
dependencies:
Image: dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:15-6-202211010921-e6fe5b10503
Port: <none>
Host Port: <none>
Args:
/scripts/wait-for-deps
You can see we are using a dev.gitlab.org for it's registry functionality (on port 5005) for our production containers. Therefore this fits criteria (C2.8) in the Gitlab Handbook: https://about.gitlab.com/handbook/engineering/infrastructure/change-management/
Any procedure involving changes of certificate authorities.
Change Details
- Services Impacted - ServiceContainer Registry
-
Change Technician -
@sgeorge - Change Reviewer - @ggillies
- Time tracking - 5 minutes
- Downtime Component - none
Detailed steps for the change
Pre Change Steps - steps to execute prior to implementation of change
-
Install certbot locally (Mac: brew install certbot) -
Generate signed Let's Encrypt cert: certbot certonly --preferred-challenges=dns --manual --config-dir ~/lets-encrypt --work-dir ~/lets-encrypt --logs-dir ~/lets-encrypt -
Fill out prompts, email: (infrastructure@gitlab.com), domain: dev.gitlab.org, registry.dev.gitlab.org (Note: the domain name isn't registry.dev.gitlab.org, it's dev.gitlab.org:5005, I'm using registry as a domain name to get past Let's Encrypt rate limitations) -
Login to AWS RT53 and find gitlab.org zone, Add TXT challenges to create domain validation of certs (found in the prompt) -
Upload the cert bundle and private key files to a Linux server (we need to do this to get the secrets to encode properly into the vault) -
Run sed -E ':a;N;$!ba;s/\r{0,1}\n/\\n/g' <certbundle.pem> -
Run sed -E ':a;N;$!ba;s/\r{0,1}\n/\\n/g' <privatekey.pem> -
You will now have encoded files to upload to chef vault -
Make copy of current chef vault for rollback dev-gitlab-org _default -
Verify openssl chain (all the way to root self-signed cert) openssl verify -CAfile combined_chain1.pem cert.pem cert.pem: OK
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Set label changein-progress /label ~change::in-progress -
Notify @sre-oncall, and @release-managers in Slack/production channel. -
Run knife vault edit dev-gitlab-org _default -
Remove cert bundle and private key under omnibus-gitlab.ssl.registry_certificateandomnibus-gitlab.ssl.registry_private_key -
Copy encoded files from the Pre-change steps cert bundle and private key into omnibus-gitlab.ssl.registry_certificateandomnibus-gitlab.ssl.registry_private_key, respectively. -
Run sudo chef clientondev.gitlab.org. -
Log into dev.gitlab.organd runsudo openssl x509 -in /etc/gitlab/ssl/registry.crt -text -nooutto verify new certificate has been uploaded to the correct place. -
Search for an old webservice container image here: https://dev.gitlab.org/groups/gitlab/charts/components/-/container_registries/302 -
Log into staging cluster glsh kube use-cluster gstg -
Start a container with an old docker image hosted in the dev.gitlab.org repo: kubectl run app-test-dev-gitlab-registry --image=dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:005efc9e1f93e6a3d2f69c04835223e080532632 -
Container will crash loop, but check pod events to see succesful image pulls ( kubectl describe deployment -
If not, proceed to rollback steps -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Use copy of chef vault used in Pre-change-steps. -
Replace secrets omnibus-gitlab.ssl.registry_certificateandomnibus-gitlab.ssl.registry_private_keywith the certs/private key from the pre-change copy -
Run sudo chef-client -
Log into dev.gitlab.organd runsudo openssl x509 -in /etc/gitlab/ssl/registry.crt -text -nooutto verify old certificate has been uploaded to the correct place. -
Log into staging cluster glsh kube use-cluster gstg -
Start a container with an old docker image hosted in the dev.gitlab.org repo: kubectl run <sgeorge-app-test-dev-gitlab-registry> --image=dev.gitlab.org:5005/some-repo/some-image:v1 -
Container will crash loop, but check pod events to see succesful image pulls ( kubectl describe pod <pod>) -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change. C2 C1:
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.