Improve reliability of CI/CD

We recently experienced some performance problem related to infrastructure. It revealed problems that it would be great to discuss, to find solution for.

I believe that we should increase CI fault tolerance.

Pipeline Unlock Worker

We started work on Pipeline Unlock Worker (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6988), which is a cron job meant to unlock pipeline that got stuck due to Sidekiq problems.
Build retry

When there are infrastructure problems, builds are likely to fail because of it. Typically, build can fail when apt-get install is not able to reach package servers. It can timeout when trying to clone dependencies. GitLab can respond with 502 when runner wants to fetch project. Making build-retry a first-class configuration entry may help with that and save a lot of time for endbosses/minibosses that usually retry builds manually.

test: script: bundle exec spinach retry: 3 ```

Specifying retry for particular commands in the build may be even better, but is significatly more difficult. See https://gitlab.com/gitlab-org/gitlab-ce/issues/3442 for more details.

Circuit breakers in GitLab Runner
1. Should we fail the build when Runner is not able to send cache?
2. Should we fail the build when runner is not able to download cache?
3. Should we fail build when runner is not able to download artifacts with implicit dependencies?
4. Should we allow build to be stuck for 24 hours when Runners Manager crashed and build it not being updated for a long time?
5. Should we monitor Runners somehow to cancel and retry builds that got stuck when particular Runner crashed?

This issue is merely an encouragement to discuss CI reliability a little more. This is a great time to do that, because we all have experienced some problems recently, which gives us a fresh point of view.

This is a meta issue that should be closed as soon as we identify problems we want to fix and we create separate issues for each.

What do you think @ayufan @markpundsack @stanhu?