Saturated disk write throughput on Gitaly VMs

While considering alternate disks for Gitaly, we are looking very closely at both IOPS and Throughtput capacity for SSDs. While investigating this, we saw that we are clearly hitting write throughput limits on the more active Gitaly disks, which results in throttling.

source

Note: that this Saturation is not seen in our Prometheus metrics, or will show up in Tamland since Stackdrive gives us lower level IO metrics for the disk device.

The instance throughput limits for SSDs is 1200MiB/sec and we can clearly see that a number of Gitaly VMs are pegged at this limit.

Notes from our discussion in Slack:

@jacobvosmaer-gitlab thinks this might be the pack-objects write-through cache which is disk-backed.
We talked about relocating this cache to a local ephemeral SSD (better performance), ~~but we are not sure if there would be issues with losing the cache on live migrations~~ but although local ssd is persisted in most cases, the cache is time constrained so using a separate volume will not work well since it is not constrained in size.
@jacobvosmaer-gitlab comments that there may be a simple way to reduce the volume of writes, that will help everyone, not just SaaS

It optimistically caches everything, including unique data that gets served only once. we could add "cache data if requested more than N times" logic. currently N is hard-coded to 0

It was noted that our disk write volume is significantly higher than our read volume, this is because most of our reads are served from memory

I think it is worth understanding this issue a bit more

Are we seeing a measurable performance degradation due to this throttling?
Are we open to improving the caching logic?

Edited Feb 28, 2023 by John Jarvis