2018-09-14: Triple Redis Cache OOM Killer Termination
Between 08h40Z and 09h40Z this morning we suffered a triple cache failure, induced by OOMKiller terminations.
https://dashboards.gitlab.net/d/WOtyonOiz/general-triage-service?from=1536914072181&to=1536918240835&orgId=1&var-prometheus_ds=Global&var-environment=gprd&var-type=redis&var-sigma=2&var-component_availability=All&var-component_ops=All&var-component_apdex=All&var-component_errors=All
This eventually led to a slowdown in web and api requests, and a spike in errors.
The only alerts we received were the new general alerts (redis availability out of normal range) but these are still a work in progress, so are currently silenced.
What can be improved in future
- Lower the Redis maxmemory setting to ~60% of available memory (currently it's 80%)
- Add alerting for when Redis is exceeding more than a certain percentage of memory (say 80%?)