Review apps hitting deployment limit
Incident description
On May 22, the Handbook and internal Handbook hit the Namespace reached its allowed limit of 500 extra deployments
error in the pages_deploy
job. This caused all MR pipelines on these projects to fail.
The incident had been related to the use of the experimental Pages Multiple deployments feature. By using this feature, the projects were configured so that all MRs have a separate, "versioned" (aka "prefixed") Pages Deployment to function as a review app.
Cause
The number of versioned/prefixed pages deployment is limited to 500 per Namespace, in this case the gitlab-com/content-sites
group.
The investigation found out that the Pages Deployments of the MRs were not deleted when the MR was closed or merged, accumulating until the limit was reached.
Further investigation by @janis determined the root cause was that the call to Pages::DeactivateMrDeploymentsWorker
in the MR model's base_service
included the MR instance instead of the MR ID as was expected.
Resolution
@janis created gitlab-org/gitlab!153965 (merged) to fix the underlying issue.
To delete the excess pages deployments whose MRs were already merged required a one-time manual intervention, since the Worker would not catch already-closed MRs. To facilitate this, @janis prioritised an already planned MR that adds a mutation to allow users to delete Pages Deployments: gitlab-org/gitlab!153981 (merged). This was subsequently used by @leetickett-gitlab to delete the orphaned Pages deployments.