Improve reliability of CI/CD
We recently experienced some performance problem related to infrastructure. It revealed problems that it would be great to discuss, to find solution for.
I believe that we should increase CI fault tolerance.
-
Pipeline Unlock Worker
We started work on Pipeline Unlock Worker (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6988), which is a cron job meant to unlock pipeline that got stuck due to Sidekiq problems.
-
Build retry
When there are infrastructure problems, builds are likely to fail because of it. Typically, build can fail when
apt-get install
is not able to reach package servers. It can timeout when trying to clone dependencies. GitLab can respond with 502 when runner wants to fetch project. Making build-retry a first-class configuration entry may help with that and save a lot of time for endbosses/minibosses that usually retry builds manually.
test: script: bundle exec spinach retry: 3 ```
Specifying retry for particular commands in the build may be even better, but is significatly more difficult. See https://gitlab.com/gitlab-org/gitlab-ce/issues/3442 for more details.
-
Circuit breakers in GitLab Runner
- Should we fail the build when Runner is not able to send cache?
- Should we fail the build when runner is not able to download cache?
- Should we fail build when runner is not able to download artifacts with implicit dependencies?
- Should we allow build to be stuck for 24 hours when Runners Manager crashed and build it not being updated for a long time?
- Should we monitor Runners somehow to cancel and retry builds that got stuck when particular Runner crashed?
This issue is merely an encouragement to discuss CI reliability a little more. This is a great time to do that, because we all have experienced some problems recently, which gives us a fresh point of view.
This is a meta issue that should be closed as soon as we identify problems we want to fix and we create separate issues for each.
What do you think @ayufan @markpundsack @stanhu?