Controlled disposal of custom certificate in Cloudflare [gprd]
Production Change
Change Summary
This is the gprd step of ACM Certificate is in use of https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14165. This has previously been executed on the gitlab.net and staging.gitlab.com zone.
The replacement certificate is already issued and active in Cloudflare. This Change Request is to manually decommission the manually maintained certificate in Cloudflare. The new certificate will take over seamlessly without user impact*
* GitLab CI runners, due to the way they handle TLS certificates, may experience a very brief period of certificate mismatches during the switchover. A retry of the job will resolve the issue. The switchover period is less than 30 seconds.
Usually a certificate replacement is done without a change request, but since we are switching the certificate provider here, I believe a CR to be fair to track this.
Change Details
- Services Impacted - ServiceAPI ServiceWeb
- Change Technician - @T4cC0re
- Change Reviewer - @steveazz
- Time tracking - 10m
-
Downtime Component - Highly unlikely. In case the targeted managed certificate does not take the traffic, the fallback certificate (
Universal SSL) will take the traffic, as it is a wildcard certificate. In this case the missing SAN will be added to the Certificate and rolled forward. A rollback is not anticipated.
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 10 minutes
-
Set label changein-progress /label ~change::in-progress -
Navigate tp the [https://dash.cloudflare.com/852e9d53d0f8adbd9205389356f2303d/gitlab.com/ssl-tls/edge-certificates](SSL/TLS section of the gitlab.comzone). -
Find the Custom Modern (default)certificate expiring on2022-05-12and delete it. -
Check the output of openssl s_client -connect gitlab.com:443 -servername gitlab.com </dev/null | openssl x509 -text-
Subject should match C = US, ST = California, L = San Francisco, O = "Cloudflare, Inc.", CN = gitlab.com -
DNS Subject alternate names should match DNS:gitlab.com, DNS:customers.gitlab.com, DNS:email.customers.gitlab.com, DNS:chef.gitlab.com, DNS:registry.gitlab.com, DNS:kas.gitlab.com, DNS:packages.gitlab.com -
If the SANs mismatch, add missing SANs into the terraform configuration.
-
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
N/A - rolling forward on issues -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: Dashboard URL
- What changes to this metric should prompt a rollback: Describe Changes
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.