2018-06-18: Spike in 500 errors across web, git and api nodes
Summary
For short intervals gitlab.com is receiving alerts that page loads were returning 500 errors. The root cause appears to be that redis is running out of memory, getting killed by the oom and failing over which results in 500s until the recovery. Investigation has shown that on June 10th something changed which caused a sharp increase in the number of keys in redis. At this time we believe this increase is due to failed login attempts that create a session key that is persisted for ~5days.
The 500 errors generated after the failover are:
Redis::CommandError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Metrics for the first set of errors:
Corrective actions
- consider using different session TTL values for authenticated and non-authenticated sessions.
- alarming sooner for increasing cache ussage
- alarming sooner on memory usage
Edited by John Jarvis