Investigate the spikiness of redis-cache latency apdex

Spinning over from https://gitlab.com/gitlab-com/gl-infra/production/issues/1722

Over the period of 2020-03-03 10:15 UTC till 2020-03-05 19:30 UTC we observed uniform spikes in the primary redis-cache latency apdex, often crossing the degradation SLO:

This is usually accompanied by regular spikes in client connections, but it could be unrelated:

(graph zoomed in to properly show the spikes)

Some network-level investigation has been done in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1722#note_298560627 but it still inconclusive.

Summary of findings (added by @igorwwwwwwwwwwwwwwwwwwww):

We are seeing short bursts of traffic on redis every minute https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420#note_310871213
Those bursts correspond to the /api/v4/groups/:id/projects api endpoint https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420#note_311620344, https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9420#note_312274480
These small bursts cause CPU saturation, with added pressure from evictions, that creates a feedback loop, driving more evictions.

Actions taken:

Increased server-side idle timeout => less connection churn, buys us some CPU production#1874 (closed)
Upgraded instance types to C2 => this buys us quite a bit of CPU, evictions have been much smoother production#1871 (closed)

Actions pending:

Apply rate limit to /api/v4/groups/:id/projects https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1906
Fix n+1 on /api/v4/groups/:id/projects (gitlab-org/gitlab#213797, caching issue gitlab-org/gitlab#214510 (closed) @engwan)

Long-term:

Scale out redis &80

Edited Apr 15, 2020 by Heinrich Lee Yu