Develop Post Deploy Gates for Container Registry Releases
Context
In gitlab-com/gl-infra/production#14263 (comment 1390610864)+ it was identified that there is no enforcement between moving a deployment of a new version of the container registry from staging to production. We should develop a procedure that must be performed against the new registry version before moving on to the next phase in the release workflow, e.g., from staging to production canary, and from production canary to production main stage.
Problem
We'll need to account for what is already automated as to not reduplicate effort, but we'll need to ensure the following are covered:
- Push/Pull tests
- Pulling a known good image
- Ensuring container registry pages list images as expected, e.g., images present and not showing "published just now"
In addition to those hard and fast rules, we should also consider creating heuristics that could indicate a problem. This issue with this is that heuristics that are based on looking for differences in behavior in registry version are going to catch both expected and unexpected behavior. A container registry maintainer should be able to quickly determine the difference; however, this starts to tie the deployment process to the few registry maintainers even more than it already is. Ideally, we have checks and alerts that are completely automated, and if that's not possible we have processes that any SRE can use to determine if there is an issue is the registry release.
Also, we should determine what phases a release moves through and what checks "gate" the progress to the next step. For example, should we separate out pre and staging and run different checks on these environments? Or should we continue to release these together and perform our checks against staging, which is the most similar to production, having been gradually migrated, unlike pre which was migrated in place.