Investigate alerting for gitaly server apdex
https://nonprod-log.gitlab.net/goto/9100ee775c34ed6fbdddc19019a5b73a
At the time this issue was opened, in the past 10 days the goserver apdex alert has paged the oncall 26 times. We have seen this across 11 different file servers.
Incidents and Corresponding Infradev Issues
| Incident | Root cause | InfraDev |
|---|---|---|
| production#5037 (closed) |
FindCommit, TreeEntry by a single user is causes degradation on the node. |
TODO |
| production#4944 (closed) | A user is pushing to that repo at 30-80 QPS. | No rate limiting for git over ssh, also related supporting the proxy protocol for git ssh TODO |
| production#4909 (closed) (and many more) |
PostUploadPack traffic to the www-gitlab-com repository (our own runners) |
gitlab-org/gitaly#3670 (closed) |
| https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5031 | A low rate of requests to the search api (5/min) to a single process causes degradation | https://gitlab.com/gitlab-org/gitlab/-/issues/334803 |
| production#5037 (closed) | One account making numerous requests to GET /api/:version/projects/:id/repository/files/:file_path
|
https://gitlab.com/gitlab-org/gitlab/-/issues/335075 |
| production#5109 (closed) & production#5330 (closed) | One account/project cloning, and then concurrency limiting in Gitaly counts against apdex | |
| production#5229 (closed) | One account/project looking up a single commit which results in slow CommitStats RPC |
gitlab-org/gitlab#337080 (closed) |
| production#5261 (closed) | Slow aggregated latency of UserMergeToRef RPC, this is now excluded from the Apdex calculations, so we're not expecting to see new incidents about this. |
gitlab-org/gitlab#336979 (closed) |
Previous efforts to contain resource utilization of Gitaly
Note: These efforts are currently stalled and not actively being worked on
- Investigate and mitigate GitLab CI impact on Gitaly performance gitlab-org&4324
- Impelement cgroups for Gitaly &344 (closed)
Edited by Craig Miskell