Gitlab.com was down briefly on February 25th

Context

Redis primary cache (redis-cache-03) was restarted on 13:47:06 UTC, it took 4 minutes to be fully operational again. The site was down for the duration of the recovery.

Timeline

On date: 2018-02-25

13:47 UTC - We got a page about the site being down
13:47 UTC - Sentry reported multiple Redis errors
13:50 UTC - We log to the primary machine to investigate
13:51 UTC - Redis finished loading DB from disk and started accepting connections
13:51 UTC - The site is back live

Incident Analysis

How was the incident detected?
Is there anything that could have been done to improve the time to detection?
How was the root cause discovered?
Was this incident triggered by a change?
Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

The process was killed due to an OOM:

[4510234.008404] Out of memory: Kill process 322 (redis-server) score 959 or sacrifice child
[4510234.015235] Killed process 322 (redis-server) total-vm:117758640kB, anon-rss:110614540kB, file-rss:0kB, shmem-rss:0kB

Apparently we've been "accumulating" large number of expiring Redis keys since Feb 8th (first figure) which led to the increase of memory consumption (second figure):

What went well

Identify the things that worked well

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

- Issue labeled as corrective action

Guidelines

Edited Feb 26, 2018 by Ahmad Sherif