Gitlab.com was down briefly on February 25th

Context

Redis primary cache (redis-cache-03) was restarted on 13:47:06 UTC, it took 4 minutes to be fully operational again. The site was down for the duration of the recovery.

Timeline

On date: 2018-02-25

  • 13:47 UTC - We got a page about the site being down
  • 13:47 UTC - Sentry reported multiple Redis errors
  • 13:50 UTC - We log to the primary machine to investigate
  • 13:51 UTC - Redis finished loading DB from disk and started accepting connections
  • 13:51 UTC - The site is back live

Incident Analysis

  • How was the incident detected?
  • Is there anything that could have been done to improve the time to detection?
  • How was the root cause discovered?
  • Was this incident triggered by a change?
  • Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

The process was killed due to an OOM:

[4510234.008404] Out of memory: Kill process 322 (redis-server) score 959 or sacrifice child
[4510234.015235] Killed process 322 (redis-server) total-vm:117758640kB, anon-rss:110614540kB, file-rss:0kB, shmem-rss:0kB

Apparently we've been "accumulating" large number of expiring Redis keys since Feb 8th (first figure) which led to the increase of memory consumption (second figure): redis-expiring redis-mem

What went well

  • Identify the things that worked well

What can be improved

  • Using the root cause analysis, explain what things can be improved.

Corrective actions

  • - Issue labeled as corrective action

Guidelines

  • Blameless Postmortems Guideline
  • 5 whys
Edited Feb 26, 2018 by Ahmad Sherif
Assignee Loading
Time tracking Loading