FY22 Q1 Delivery OKR: Reduce Mean Time To Mitigation via rollbacks for software-change related incidents => 30%
Working epic - &282 (closed)
Starting situation A production incident caused by a software change is mitigated by a fix or revert MR being merged and deployed through to production.
With the current pipeline this takes around ~6 hours to complete the following stages: Tag, deploy to staging, deploy to canary, canary baking time, deploy to production
Hot patches exist and can be used to mitigate severe incidents more quickly but introduce significant risk by bypassing the deployment pipeline.
Desired situation Enable rollbacks to a previous deployment to allow software change incidents to be mitigated quickly without the need for a hot patch.
Key results
-
MTTM reduced by 20% - from 5.6 hours down to 3.8 hours for labels with ~RootCause::Software-Change -
The number of hot patches applied to production reduced to 0 -
Frequency of deploys is unaffected by rollbacks, measured through no negative impact on MTTP
Status 2021-04-30 - Delivery have been running rollbacks each week on Staging to test the rollback pipeline and make sure we know how to recover from pipeline failures. Additional tools and documentation has been created to help check whether a deployment can be rolled back safely. The rollback pipeline has been shortened by removing assets and making the Gitaly and Praefect steps optional for rollback. At the end of the quarter a successful production rollback in dry-run mode was completed.
We didn't manage to run a real rollback on production due to the number of test cases we chose to test on staging and delays in scheduling on Production to avoid impacting incidents or releases. Because of this we can't yet measure an improvement towards MTTM. We don't expect rollbacks to have a negative impact on deployments but again this will need to be measured for a longer period of time.
Throughout Q1 only 1 hotpatch was applied to production. The issue would not have been suitable to rollback as it was merged in a number of days before detection.