Investigate the spikiness of redis-cache latency apdex
Spinning over from https://gitlab.com/gitlab-com/gl-infra/production/issues/1722
Over the period of 2020-03-03 10:15 UTC till 2020-03-05 19:30 UTC we observed uniform spikes in the primary redis-cache latency apdex, often crossing the degradation SLO:
This is usually accompanied by regular spikes in client connections, but it could be unrelated:
(graph zoomed in to properly show the spikes)
Some network-level investigation has been done in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1722#note_298560627 but it still inconclusive.
Summary of findings (added by @igorwwwwwwwwwwwwwwwwwwww):
- We are seeing short bursts of traffic on redis every minute https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420#note_310871213
- Those bursts correspond to the
/api/v4/groups/:id/projects
api endpoint https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420#note_311620344, https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420#note_312274480 - These small bursts cause CPU saturation, with added pressure from evictions, that creates a feedback loop, driving more evictions.
Actions taken:
- Increased server-side idle timeout => less connection churn, buys us some CPU production#1874 (closed)
- Upgraded instance types to C2 => this buys us quite a bit of CPU, evictions have been much smoother production#1871 (closed)
Actions pending:
- Apply rate limit to
/api/v4/groups/:id/projects
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1906 - Fix n+1 on
/api/v4/groups/:id/projects
(gitlab-org/gitlab#213797, caching issue gitlab-org/gitlab#214510 (closed) @engwan)
Long-term:
- Scale out redis &80
Edited by Heinrich Lee Yu