Gitlab.com was down briefly on February 25th
Context
Redis primary cache (redis-cache-03) was restarted on 13:47:06 UTC, it took 4 minutes to be fully operational again. The site was down for the duration of the recovery.
Timeline
On date: 2018-02-25
- 13:47 UTC - We got a page about the site being down
- 13:47 UTC - Sentry reported multiple Redis errors
- 13:50 UTC - We log to the primary machine to investigate
- 13:51 UTC - Redis finished loading DB from disk and started accepting connections
- 13:51 UTC - The site is back live
Incident Analysis
- How was the incident detected?
- Is there anything that could have been done to improve the time to detection?
- How was the root cause discovered?
- Was this incident triggered by a change?
- Was there an existing issue that would have either prevented this incident or reduced the impact?
Root Cause Analysis
The process was killed due to an OOM:
[4510234.008404] Out of memory: Kill process 322 (redis-server) score 959 or sacrifice child
[4510234.015235] Killed process 322 (redis-server) total-vm:117758640kB, anon-rss:110614540kB, file-rss:0kB, shmem-rss:0kB
Apparently we've been "accumulating" large number of expiring Redis keys since Feb 8th (first figure) which led to the increase of memory consumption (second figure):

What went well
- Identify the things that worked well
What can be improved
- Using the root cause analysis, explain what things can be improved.
Corrective actions
- - Issue labeled as corrective action
Guidelines
Edited by Ahmad Sherif