Review Apps have a lot of dependencies that make them brittle and slow to be deployed
Currently, before we can deploy a Review App (
review-deploy in the
test stage), we need to:
- Wait for the
teststage) job to finish, it takes between 16 and 19 minutes: if this job fails, we continue the deploy nonetheless (the only downside is that if the MR changes assets, that won't be reflected in the deployed review app).
- Trigger a build in
CNG-mirrorto build the CNG images, it takes between 3 and 16 minutes depending of the MR changes and CNG cache: if this job fails, the deploy fails because we won't have the necessary CNG images to deploy the review app.
- Start the deploy using Helm, that takes between 3 and 12 minutes.
- The deploy consists of several step, which can individually fail, for various reasons (see a non-exhaustive list in https://gitlab.com/gitlab-org/gitlab-ce/issues/53713), most of them are transient and the deploy usually succeed after a few retries.
- Even if the Helm deploy fails (usually because it hits the 10 minute timeout), the Review App can actually be accessible after a few minutes (the pods scheduling can take more than the allowed 10 minutes, again, for various reasons depending on many Kubernetes problems).
The critical step is really the last one, and I feel that most of the "Kubernetes problems" are related to memory pressure, see https://gitlab.com/gitlab-org/gitlab-ce/issues/53713#note_132724313.
We should make the
gitlab:assets:compilejob faster (ideally under 10 minutes), so that we can run it in the
preparestage and stop using the
wait_for_job_to_be_done "gitlab:assets:compile"hack (used to depend on a job in the same stage) which adds a point of failure. => https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24542
- We should be able to move the
CNG-mirrortriggering to the
preparestage too, but if the MR updates any assets, that would mean the CNG images wouldn't use these ones unless we make the
CNG-mirrortriggering job dependent of
gitlab:assets:compile(but then we'd still need to use
- We could use 1 node per Review App to ensure that each node isn't under memory pressure? How to do it with Helm / the chart, though?
- It could happen that sometimes
tilleris failing, potentially because of the memory pressure issues, so it could be beneficial to put the
tillerpods on a dedicated node.