Plan for post-deploy migrations through to August 2023
Following last week's incident, we need to help protect Availability for the remainder of July.
One way we can help with this is by reviewing the time and frequency of PDM execution to give us better chances of being able to rollback if needed. We normally try to run PDM once per day, and usually execute it around 1300 UTC. Perhaps only running the PDM on Tuesday and Thursday, or even only on Tuesday, at the very start of the EMEA shift would make things safer? This change would mean:
- We run PDM once/twice a week instead of potentially 5 times. This impacts Development but it could be an acceptable adjustment if it makes things safer.
- We would deploy at the end of the AMER day, have the build sit for ~8 hours (for APAC), and then run PDM. In theory, this gives enough time for us to decide whether the previous deploy is stable before moving on.
We could also consider always rolling back production if we suspect a software change is the cause of an incident. In our normal approach, we try to confirm this before rolling back, this can add 30-90 minutes to the incident duration if the deployment does turn out to be the cause.
@ahyield @skarbek @mbursi - what are your thoughts on these ideas? Are there any other ways we could help protect Availability?
Post Deployment Migration Plan until 31st of July 2023
PDM pipeline will be executed twice a week, in the following dates:
- 2023-07-13 (delayed to 2023-07-14 due to incidents)
- 2023-07-18 - to be run twice due to the need to reduce risk on the upcoming release - #19475 (comment 1475274593)
- 2023-07-19 (exception to run the pdm for tagging the RC.)
- 2023-07-20
- 2023-07-25
- 2023-07-27
for any questions please mention @release-managers
in Slack or use the GitLab handle @gitlab-org/release/managers in issues