GitLab outage due to redis-cache memory usage
Summary
There was a 2 minute outage of GitLab.com which was due to emergency maintenance to avoid a redis-cache failover. The performance degradation for the two minutes was observed immediately after lowering the maxmemory setting on the production cluster. This was the result of emergency maintenance, necessary to avoid a failover which previously happened with production#467 (closed)
Timeline
- 14:23 - Redis memory alert indicating our memory usage was approaching the limit for our redis cache cluster
- 14:30 - After a discussion with @andrewn we decided to set the maxmemory of the redis cluster. The configuration was made in chef with MR https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/2562
- 14:45 - This was validated on staging, and did not see any impact although it was noted that memory didn't release as expected. It was decided that we should move forward and run this on production.
- 15:30 - The maxmemory limit was reduced on staging which caused performance degradation and an error increase on gitlab.com.