Update int.gprd.gitlab.net SSL certificate in production
Production Change
Change Summary
The internal interface for haproxy nodes has an SSL certificate that will expire on May 14th.
Change Details
- Services Impacted - GPRD Haproxy
- Change Technician - @cmcfarland
- Change Criticality - C3
- Change Type - changeunscheduled
- Change Reviewer - @nhoppe1
- Due Date - Date and time (in UTC) for the execution of the change
- Time tracking - 55
- Downtime Component - N/A
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 20
-
Create a backup of the existing key and certificate: ./bin/gkms-vault-show frontend-loadbalancer gprd | grep internal > internal.backup -
Create JSON-ified versions of the new chained certificate: awk 'NF {sub(/\r/, ""); printf "%s\\n",$0;}' \*.gprd.gitlab.net.chained.crt > \*.gprd.gitlab.net.json.chained.crt -
Verify the certificate expiration time and date: knife ssh roles:gprd-base-lb-fe "echo -n | openssl s_client -showcerts -servername int.gprd.gitlab.net -connect localhost:443 2>/dev/null | openssl x509 -inform pem -noout -text | grep 'Not After'"
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 35
-
Set label changein-progress on this issue -
Replace the certificate in the frontend-loadbalancer gprdgkms vault. The key for the cert isinternal_crt. -
Verify the private key matches in the frontend-loadbalancer gprdvault. They key for the private key isinternal_key. -
Run chef locally on a single front end node: ssh fe-01-lb-gprd.c.gitlab-production.internal "sudo chef-client"
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Verify the certificate expiration time and date: knife ssh roles:gprd-base-lb-fe "echo -n | openssl s_client -showcerts -servername int.gprd.gitlab.net -connect localhost:443 2>/dev/null | openssl x509 -inform pem -noout -text | grep 'Not After'"
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 15
-
Edit the frontend-loadbalancer gprdgkms vault and replace the values with the old certificate and key. -
Force a chef run on the front end nodes.
Monitoring
Key metrics to observe
- Metric: SSL Cert Expiration
- Location: https://thanos-query.ops.gitlab.net/graph?g0.range_input=15m&g0.max_source_resolution=0s&g0.expr=probe_ssl_earliest_cert_expiry%7Benvironment%3D%22gprd%22%7D%20-%20time()%20%3C%2014%20*%2086400&g0.tab=0
- What changes to this metric should prompt a rollback: Describe Changes
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Cameron McFarland