Review Apps have a lot of dependencies that make them brittle and slow to be deployed
Problems
Currently, before we can deploy a Review App (review-deploy
in the test
stage), we need to:
- Wait for the
gitlab:assets:compile
(test
stage) job to finish, it takes between 16 and 19 minutes: if this job fails, we continue the deploy nonetheless (the only downside is that if the MR changes assets, that won't be reflected in the deployed review app). - Trigger a build in
CNG-mirror
to build the CNG images, it takes between 3 and 16 minutes depending of the MR changes and CNG cache: if this job fails, the deploy fails because we won't have the necessary CNG images to deploy the review app. - Start the deploy using Helm, that takes between 3 and 12 minutes.
- The deploy consists of several step, which can individually fail, for various reasons (see a non-exhaustive list in https://gitlab.com/gitlab-org/gitlab-ce/issues/53713), most of them are transient and the deploy usually succeed after a few retries.
- Even if the Helm deploy fails (usually because it hits the 10 minute timeout), the Review App can actually be accessible after a few minutes (the pods scheduling can take more than the allowed 10 minutes, again, for various reasons depending on many Kubernetes problems).
The critical step is really the last one, and I feel that most of the "Kubernetes problems" are related to memory pressure, see https://gitlab.com/gitlab-org/gitlab-ce/issues/53713#note_132724313.
Potential solutions
-
We should make the gitlab:assets:compile
job faster (ideally under 10 minutes), so that we can run it in theprepare
stage and stop using thewait_for_job_to_be_done "gitlab:assets:compile"
hack (used to depend on a job in the same stage) which adds a point of failure. => https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24542 - We should be able to move the
CNG-mirror
triggering to theprepare
stage too, but if the MR updates any assets, that would mean the CNG images wouldn't use these ones unless we make theCNG-mirror
triggering job dependent ofgitlab:assets:compile
(but then we'd still need to usewait_for_job_to_be_done
...). - We could use 1 node per Review App to ensure that each node isn't under memory pressure? How to do it with Helm / the chart, though?
- It could happen that sometimes
tiller
is failing, potentially because of the memory pressure issues, so it could be beneficial to put thetiller
pods on a dedicated node.
Edited by Rémy Coutable