Investigate how https://gitlab.my.salesforce.com/00161000005h0vo caused so much damage to a single Gitaly server

Marquee customer https://gitlab.my.salesforce.com/00161000005h0vo are performing some PoC's on GitLab.com.

During one of their tests on 2020-01-23 22h05 and 22h19, they, as @cmiskell put it, "melted the Gitaly node down for slag".

In particular CPU and memory were completely saturated.

Gitaly stats during this period

https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly-host-detail?orgId=1&from=1579815260366&to=1579820326164&var-PROMETHEUS_DS=Global&var-environment=gprd&var-fqdn=file-marquee-02-stor-gprd.c.gitlab-production.internal

Gitaly logs during that period

https://log.gitlab.net/goto/b2f26d25f8169d5ce591dd618cfeda6c

What were they doing?

Mostly PostUploadPack operations. The slowest took over 2 hours, but most were in the region of 5 to 10 minutes.

Did the concurrency limiter kick in?

Yes, up to about 15% of requests were rate limited

image source

cc @francispotter

Assignee Loading
Time tracking Loading