DB load balancers and effect on CI processing
It seems that I do know why we see recently a lot of reports of CI problems:
- specific runners not picking builds,
- builds failing and dying in between,
- pipelines blocking in between stages: first stage: success, second stage: is created, not pending.
Following my comment from https://gitlab.com/gitlab-com/support-forum/issues/1622#note_25988719,
I have a very valid explanation for at least the first one, but looking at system behavior at least 3. is also being as described: What happens, we have DB load balancing, where one is read-write, the second one is read-only: basically a master and slave.
When runner asks for new builds it asks specific endpoint that uses either of those and this is where the magic happens:
- We push a new build to read-write,
- We change key in Redis and notify all runners to retry picking builds,
- There's replication delay,
- Runner connects to build/register and checks a list of builds,
- No build is found, we return a changed key,
- Endpoint checks read-only DB sees that are no new build, we return 204 and changed notification key,
- Runner stores notification key, asks again, hits happy path due to notification to be the same,
- Replication catches up, build is pending, but runner is in sync with notification that everything is up to date,
- It fails to pick a new build.
What we need: After enqueuing build (changing status from pending to running) there should be barrier before creating the notification or enqueueing sidekiq jobs to ensure that all slaves are up-to-date.
Otherwise, it creates a split brain problem: we expect that information that is stored in Redis is an indication of something that is committed to DB. With current configuration, it is not true, as commit on replicas may be done after Redis value being saved.
This is also showing here: https://gitlab.com/gitlab-com/www-gitlab-com/pipelines/7178220