Change: enable several Gitaly feature flags in production
PLANNING THE CHANGE
For more background on when this template should be used, see the infrastructure handbook.
-
Context: What is the background of the change? Relevant links? - Enable the following Gitaly feature flags in production:
GITALY_BRANCH_NAMES,GITALY_TAG_NAMES,GITALY_FIND_REF_NAME,GITALY_IS_ANCESTOR,GITALY_ROOT_REF - These flags are related to the following migrations:
- They have already been tested individually in production. Evidence of this testing is available in the above issues.
- Since the testing was successful, we plan to enable them permanently while monitoring for issues
- Enable the following Gitaly feature flags in production:
-
Downtime: Will the change introduce downtime, and if so, how much? -
There may be some intermittent 502 errors due to restarting Unicorns -
We consider this change to have a low-risk of causing downtime -
What options were considered to avoid downtime? - These features are enabled in staging
- We will test in the canary environment (MR)
- We have tested in production individually
-
What is the downtime estimate based on? Can it be tested in some way? - See above
-
-
People: -
@ahmadsherif will carry out the change -
@andrewn will monitor and handle communications
-
-
Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc. -
Alerting and dashboards are configured -
Check that you have all the correct versions of the required software installed in the affected hosts. -
We will ensure that the correct version of Gitaly is running on all hosts before performing the change
-
-
Check that you have the right access level to the required resources. -
Since @ahmadsherif has already carried out this change, this should not be a problem
-
-
-
Change Procedure: -
List the steps that are needed for the change; be as granular as possible. - @ahmadsherif to create MR for change
- @ahmadsherif to have MR reviewed and approved by Senior Production Engineer
- @ahmadsherif to merge change into production
-
@ahmadsherif to use
knifeto apply this change to the web worker fleet - @ahmadsherif to systematically restart workers to apply change, one by one
- @andrewn to monitor dashboards and check for alerts
-
Did you do a dry run to test / measure performance and timings? - Yes, we have tested these changes. See gitlab-org/gitaly#211 (closed), gitlab-org/gitaly#210 (closed), gitlab-org/gitaly#215 (closed)
-
-
Preparatory Steps: What can be done ahead of time? How far ahead? -
We have tested each change individually gitlab-org/gitaly#211 (closed), gitlab-org/gitaly#210 (closed), gitlab-org/gitaly#215 (closed) -
When determined what these may be, add them to the "Preparatory Steps" section below under "Doing the Change".
-
-
Post-checks: What should we check after the change has been applied? -
We will monitor dashboards and ensure that no alerts are generated -
How do we know the change worked well? How do we know the change did not work well? What monitoring do we need to pay attention to? How do we verify data integrity? - We will monitor dashboards and alerts
-
Should any alerts be modified as a consequence of this change? - Alerts have already been configured.
-
-
Rollback procedure: In case things go wrong, what do we need to do to recover? Also consider rolling back from an intermediate step: does the procedure change depending on how far along the procedure is? - Rollback by reverting the change and disabling the feature flag environment variables
-
Create an invite using a 4 hr block of time on the "GitLab Production" calendar(link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.) -
Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling. -
When will this occur? leave blank until scheduled -
Communication plan: -
Tweet: immediately before the change, tweet: "We're switching several git endpoints across to Gitaly. No downtime is expected. See https://gitlab.com/gitlab-com/infrastructure/issues/1951 for details." -
Deploy banner: display warning 1 hr before -
Tweet: after change: "We've successfully completed our Gitaly change. Everything is working as expected. See https://gitlab.com/gitlab-com/infrastructure/issues/1951 for details."
-
DOING THE CHANGE
Preparatory steps
-
Copy/paste items here from the Preparatory Steps listed above. -
Perform any Pre Check that is necessary before executing the change
Initial Tasks
-
Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available. -
Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow. -
Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC. -
Link the document in the on-call log so it's easy to find later. -
Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
-
-
Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting. -
Final check of the rollback plan and communication plan. -
Set PagerDuty maintenance window before starting the change.
The Change
- Before starting the Change
-
Tweet to publicly notify that you are performing a change in production following the guidelines.
-
- Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.
- When the change is done and finished, either successfully or not
-
Tweet again to notify that the change is finished and point to the change issue. -
Copy the content of the document back into the issue redacting any data that is necessary to keep it blameless and deprecate the doc. -
Perform a quick post mortem following the Blameless Postmortem guideline in the infrastructure handbook in a new issue. -
If the issue caused an outage, or service degradation, label the issue as "outage".
-