Database Replication Lag from 10.2.0-RC2-ee Deploy

Context

In the deployment of 10.2.0-RC2-ee there was a massive post migration job that tried to update all 4.7M rows of the merge_requests table. This job was not backgrounded, but was batched. The result of this job run was that the follower nodes were unable to keep up with the replication stream from the primary node. This resulted in users receiving page loads with stale or missing data

Timeline

On date: 2017-11-16

19:15 UTC - @mkozono started deploying 10.2.0-RC2-ee to production.
20:49 UTC - Prometheus triggers PagerDuty for 'Postgres Replication Lag over 2 Minutes'.
20:56 UTC - @stanhu notices 404 errors on GitLab and @eReGeBe points out tweets from users expressing errors.
21:08 UTC - 📤 Replication Lag is at 27GB to catch up - ⛰ peak delay.
21:10 UTC - @stanhu announces that @mkozono has stopped the post migration job.
21:14 UTC - 📤 Replication Lag is at 25GB to catch up.
21:17 UTC - @stanhu has done back of napkin math noting that it'll take roughly 45 minutes for replication to catch up.
21:30 UTC - @mkozono announces via tweet that we have finished the deployment and are aware and watching replication lag.
21:33 UTC - 📤 Replication lag is at 15GB to catch up.
21:42 UTC - @northrup announce via tweet that users my experience 404s due to replication lag, and that data is not lost.
21:44 UTC - 📤 Replication lag is at 6.8GB to catch up.
21:59 UTC - 📤 Replication lag is at 50MB to catch up.
22:14 UTC - @northrup announces via twitter that replication lag is resolved and GitLab is fully operational.

Replication lag: https://performance.gitlab.net/dashboard/db/postgres-stats?panelId=11&fullscreen&orgId=1&from=1510860519180&to=1510875497294

Incident Analysis

How was the incident detected?
- PagerDuty pages the oncall (@northrup) that replication lag has exceeded limits.
Is there anything that could have been done to improve the time to detection?
- This exception was caught within the acceptable amount of time.
How was the root cause discovered?
- It was apparent within the deployment that the post migration was going to take a long time.
Was this incident triggered by a change?
- Yes, this incident was triggered by the deployment of 10.2.0-RC2-ee
Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

For Ex:

At 00:00 UTC something happened that led to downtime

Why did X caused downtime?

...

What went well

Identify the things that worked well

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

- Issue labeled as corrective action

Guidelines

Edited Nov 17, 2017 by Stan Hu