Cache commit stats in order to avoid expensive gitaly calls

Background

We are dealing with an active incident where a single user is able to trigger a lot of heavy gitaly RPCs: gitlab-com/gl-infra/production#5229 (closed).

These calls degrade the experience for that user, but more importantly can consume a lot of CPU resources on the gitaly server. Once resource utilization reaches a tipping point, performance for all repos hosted on that gitaly server degrades.

As a mitigation, we are manually applying stricter rate limits to specific paths.

Impact

This behaviour has resulted in 27 alerts over the last 48 hours:

Screenshot_2021-07-28_at_12.28.15

https://nonprod-log.gitlab.net/goto/ff7dd80b5054aa922b4474293264ec0f

Proposal

In order to mitigate the impact of these slow calls, we could cache the output of the CommitStats RPC. This would protect us against repeated calls against the same commit.

The endpoint in question is the Commits API (which defaults to including commit stats).

Verification

Rate of CommitStats RPCs taking longer than 10 seconds should consistently be under 10 per hour.

Screenshot_2021-07-28_at_13.33.11

https://log.gprd.gitlab.net/goto/1af19691a77914b95cbd8f6112a7df9d

Please note that this metric may be artificially lowered by SRE applying rate limiting as a mitigation: https://gitlab.com/gitlab-com/gl-infra/cloudflare-firewall/-/issues/89.

Edited by Igor