Rotate JWT signing keys in CDot non-production environments

Production Change

Change Summary

Yearly rotation of the following JWT signing keys used in CustomersDot non-production environments (dev, test, stg, stg-ref):

  • cdot_gitlab_internal_jwt_signing_key
  • cdot_cloud_connector_jwt_signing_key

We'll open a separate change management issue for rotating these keys in production after the non-production environments are completed successfully.

Change Details

  1. Services Impacted - ServiceCustomersDot ServiceAIGateway
  2. Change Technician - @tyleramos
  3. Change Reviewer - DRI for the review of this change
  4. Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - Start date and time planned to execute change steps YYYY-MM-DD HH:MM
  5. Time tracking - This change should only involve code changes in CDot. No manual work should be required.
  6. Downtime Component - none

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Change steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Set label changein-progress /label ~change::in-progress

  • Generate new RSA private keys by running bin/generate_rsa_key_pair for each non-production env.

  • Open MR to update the *_jwt_validation_key value in credentials.yml.enc for each enironment with the keys generated from previous step. This ensures that validators get access to the new key before we use it to sign tokens.

  • Merge MR to deploy this change: the JWKS endpoint should now serve the new key

    • Check if discovery jwks endpoint returns both keys
    • Monitor service logs of the validating service to ensure there is no increase in 401s.
      • Refer to the respective service runbook for how to do this.
      • For CDot internal API call to GitLab, check the Kibana logs for any anomaly.
  • Invalidate caches in validators by waiting at least 24 hours. All services that validate tokens should then have refreshed their key sets and can now validate tokens signed with both the old and new key.

  • Open MR to swap *_jwt_signing_key and *_jwt_validation_key. This implies all new tokens will be signed with the key generated in step 1 and tokens signed with the old *_jwt_signing_key can still be validated by the current *_jwt_validation_key

  • Merge MR to deploy this change.

    • Monitor service logs of the validating service to ensure there is no increase in 401s.
      • Refer to the respective service runbook for how to do this.
      • For CDot internal API call to GitLab, check the Kibana logs for any anomaly.
  • Open MR to remove the value of *_jwt_validation_key in credentials.yml.enc, after 3 days.

  • Merge MR to deploy this change.

    • Verify that the JWKS endpoint does not include this key anymore.
  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • Rollback Step 1
  • Rollback Step 2
  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

  • Check if discovery jwks endpoint returns both keys
  • Monitor service logs of the validating service to ensure there is no increase in 401s.
    • Refer to the respective service runbook for how to do this.
    • For CDot internal API call to GitLab, check the Kibana logs for any anomaly.

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • The change plan is technically accurate.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • The change execution window respects the Production Change Lock periods.
  • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
  • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers in this issue to request approval and provide visibility to all infrastructure managers.
  • For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
Edited by Tyler Amos