Skip to content

Change: enable several Gitaly feature flags in production

PLANNING THE CHANGE

For more background on when this template should be used, see the infrastructure handbook.

  • Context: What is the background of the change? Relevant links?

    • Enable the following Gitaly feature flags in production: GITALY_BRANCH_NAMES, GITALY_TAG_NAMES, GITALY_FIND_REF_NAME, GITALY_IS_ANCESTOR, GITALY_ROOT_REF
    • These flags are related to the following migrations:
    • They have already been tested individually in production. Evidence of this testing is available in the above issues.
    • Since the testing was successful, we plan to enable them permanently while monitoring for issues
  • Downtime: Will the change introduce downtime, and if so, how much?

    • There may be some intermittent 502 errors due to restarting Unicorns
    • We consider this change to have a low-risk of causing downtime
    • What options were considered to avoid downtime?
      • These features are enabled in staging
      • We will test in the canary environment (MR)
      • We have tested in production individually
    • What is the downtime estimate based on? Can it be tested in some way?
      • See above
  • People:

  • Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc.

  • Change Procedure:

  • Preparatory Steps: What can be done ahead of time? How far ahead?

  • Post-checks: What should we check after the change has been applied?

    • We will monitor dashboards and ensure that no alerts are generated
    • How do we know the change worked well? How do we know the change did not work well? What monitoring do we need to pay attention to? How do we verify data integrity?
      • We will monitor dashboards and alerts
    • Should any alerts be modified as a consequence of this change?
      • Alerts have already been configured.
  • Rollback procedure: In case things go wrong, what do we need to do to recover? Also consider rolling back from an intermediate step: does the procedure change depending on how far along the procedure is?

    • Rollback by reverting the change and disabling the feature flag environment variables
  • Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.)

  • Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling.

  • When will this occur? leave blank until scheduled

  • Communication plan:

DOING THE CHANGE

Preparatory steps

  • Copy/paste items here from the Preparatory Steps listed above.
  • Perform any Pre Check that is necessary before executing the change

Initial Tasks

  • Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available.
    • Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow.
    • Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC.
    • Link the document in the on-call log so it's easy to find later.
    • Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
  • Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting.
  • Final check of the rollback plan and communication plan.
  • Set PagerDuty maintenance window before starting the change.

The Change

  • Before starting the Change
    • Tweet to publicly notify that you are performing a change in production following the guidelines.
  • Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.
  • When the change is done and finished, either successfully or not
    • Tweet again to notify that the change is finished and point to the change issue.
    • Copy the content of the document back into the issue redacting any data that is necessary to keep it blameless and deprecate the doc.
    • Perform a quick post mortem following the Blameless Postmortem guideline in the infrastructure handbook in a new issue.
    • If the issue caused an outage, or service degradation, label the issue as "outage".
Edited by Andrew Newdigate