2022-12-23 (C2): Automate dev.gitlab.org and registry to use Let's Encrypt certificate via certbot/omnibus-gitlab

Production Change

Change Summary

This issue is related to yesterday's aborted change: #7976 (closed). I have identified the issue, and fixed it for today.

As part of https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14554 we will be using the Let's Encrypt certbot package in omnibus-gitlab to run Let's Encrypt certificates that will auto-renew as opposed to using the current manual uploading and rotation of certs process.

A manual certificate, hosted in the Chef vault, is currently being presented by the registry, and uploaded as part of this Change: #8019 (closed) to avoid SSL expiry.

Current Manual cert:

shimrangeorge in ~/Desktop/gitlab-repos/cookbook-omnibus-gitlab/files on branch cmcfarland/attempt1 > echo | openssl s_client -showcerts -servername dev.gitlab.org -connect 34.139.135.192:5005 2>/dev/null | openssl x509 -inform pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            04:9d:4e:74:e1:5f:97:0e:0f:32:97:c5:e2:17:1f:30:de:a3
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=US, O=Let's Encrypt, CN=R3
        Validity
            Not Before: Nov  8 02:20:58 2022 GMT
            Not After : Feb  6 02:20:57 2023 GMT
        Subject: CN=dev.gitlab.org
        Subject Public Key Info:
            Public Key Algorithm: rsaEnc

We are currently removing nginx-registry attributes in gitlab.rb to enable to omnibus automation for registry: https://docs.gitlab.com/omnibus/settings/ssl/index.html

Per documentation:

#registry_nginx['ssl_certificate'] = "path/to/cert"      # Must be absent or commented out

Therefore, the omnibus-gitlab letsencrypt module, despite being activated, cannot take over the certificates because it is being overriden in the gitlab.rb. The certificate that is encrypting registry traffic is currently the manual certificate installed in the vault.

Benefits:

  • Let's Encrypt is free, and out certbot automation in our omnibus-gitlab package makes enablement easy and renewal automated.

Why is this a C2 change?

Despite its name, we actually use dev.gitlab.org's registry for all our production container pulls in Kubernetes (and elsewhere).

If there is an issue with swapping out the certificates:

  • Pods will fail to restart/redeploy due to failed container image pulls, leading to an incremental service outage across all clusters, and all pods.

Here is an example of webservice-web deployment object in our Production cluster:

kubectl -n gitlab describe deployment gitlab-webservice-web

   dependencies:
    Image:      dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:15-6-202211010921-e6fe5b10503
    Port:       <none>
    Host Port:  <none>
    Args:
      /scripts/wait-for-deps

You can see we are using a dev.gitlab.org for it's registry functionality (on port 5005) for our production containers.

Therefore this fits criteria (C2.8) in the Gitlab Handbook:

https://about.gitlab.com/handbook/engineering/infrastructure/change-management/

Any procedure involving changes of certificate authorities.

Change Details

  1. Services Impacted - All Kubernetes pods
  2. Change Technician - @sgeorge
  3. Change Reviewer - @f_santos
  4. Time tracking - 10-15 minutes of implementation (Including pipeline). Roughly 10 minutes rollback, and 5 minutes of verification of rollback.
  5. Downtime Component - No anticipated downtime, if everything goes smoothly.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 10 minutes

  • Set label changein-progress /label ~change::in-progress
  • Announce change in Slack and cc: @release-mangers for visibility
  • Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2531
  • Run sudo chef client on dev.gitlab.org and, if necessary, gitlab-ctl reconfigure
  • Run locally echo | openssl s_client -showcerts -servername dev.gitlab.org -connect 34.139.135.192:5005 2>/dev/null | openssl x509 -inform pem -noout -text to confirm registry is being managed by omnibus-letsencrypt.
  • Search for an old webservice container image here: https://dev.gitlab.org/groups/gitlab/charts/components/-/container_registries/302
  • Log into staging cluster glsh kube use-cluster gstg
  • Start a container with an old docker image hosted in the dev.gitlab.org repo:kubectl run app-test-dev-gitlab-registry --image=dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-webservice-ee:00a62fb1ec52f0e0750c7360d4fcf66b8b3aaaac
  • Container will crash loop, but check pod events to see succesful image pulls (kubectl describe deployment (If this fails: proceed to Rollback Steps)
  • CLEANUP Remove cert bundle and private key under omnibus-gitlab.ssl.registry_certificate and omnibus-gitlab.ssl.registry_private_key
  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

Monitoring

Key metrics to observe

  • Metric: Metric Name
    • Location: Dashboard URL
    • What changes to this metric should prompt a rollback: Describe Changes

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Shimran George