Production Readiness Review of packer-based OS upgrade process
We are switching to a new process that brings with it some risks, documenting those risks and the considerations around this new process somewhere is one of the functions of a readiness review.
We should document:
- Motivation for image-based process (reducing unavailability during maintenance).
- High level design of the new process, this can also link to https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/blob/master/packer/README.md and/or other docs.
- Risks this new process introduces, and how we are mitigating them.
- Residue left on packer images during the image building process, mitigated by overriding hostname.
- Interference with deploy process, mitigated by coordinating with delivery to halt deploys.
- Stale images, mitigated by building a fresh image right before deploying it.
- Unexpected failures, mitigated by: snapshotting the old disks, packertest canary, ability to fall back to legacy full-rebuild process.
- Operator error, mitigated by strong automation and idempotent design.
Those are the main ones that come to my mind, but perhaps you have some more ideas.
Edited by Igor