Review Apps have a lot of dependencies that make them brittle and slow to be deployed

Problems

Currently, before we can deploy a Review App (review-deploy in the test stage), we need to:

Wait for the gitlab:assets:compile (test stage) job to finish, it takes between 16 and 19 minutes: if this job fails, we continue the deploy nonetheless (the only downside is that if the MR changes assets, that won't be reflected in the deployed review app).
Trigger a build in CNG-mirror to build the CNG images, it takes between 3 and 16 minutes depending of the MR changes and CNG cache: if this job fails, the deploy fails because we won't have the necessary CNG images to deploy the review app.
Start the deploy using Helm, that takes between 3 and 12 minutes.
- The deploy consists of several step, which can individually fail, for various reasons (see a non-exhaustive list in https://gitlab.com/gitlab-org/gitlab-ce/issues/53713), most of them are transient and the deploy usually succeed after a few retries.
- Even if the Helm deploy fails (usually because it hits the 10 minute timeout), the Review App can actually be accessible after a few minutes (the pods scheduling can take more than the allowed 10 minutes, again, for various reasons depending on many Kubernetes problems).

The critical step is really the last one, and I feel that most of the "Kubernetes problems" are related to memory pressure, see https://gitlab.com/gitlab-org/gitlab-ce/issues/53713#note_132724313.

We should make the gitlab:assets:compile job faster (ideally under 10 minutes), so that we can run it in the prepare stage and stop using the wait_for_job_to_be_done "gitlab:assets:compile" hack (used to depend on a job in the same stage) which adds a point of failure. => https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24542
We should be able to move the CNG-mirror triggering to the prepare stage too, but if the MR updates any assets, that would mean the CNG images wouldn't use these ones unless we make the CNG-mirror triggering job dependent of gitlab:assets:compile (but then we'd still need to use wait_for_job_to_be_done...).
We could use 1 node per Review App to ensure that each node isn't under memory pressure? How to do it with Helm / the chart, though?
It could happen that sometimes tiller is failing, potentially because of the memory pressure issues, so it could be beneficial to put the tiller pods on a dedicated node.

Edited Feb 01, 2019 by Rémy Coutable