How to address the filesystem slowdown
We have been hitting IOPS limits with CephFS, causing Azure to throttle us:
I checked all of your VMs in resource group Ceph-Prod, they are all experiencing heavy throttling. I see average throttle delay introduced is 40msec, but it spikes up to 100msec. I can see you are already using the highest disk tier (P30) and VM size (DS14), most of these throttled IOs are writes, and I can also see that cache hit ratio is for reads is really high, so you are not reading much information from the disks themselves as it appears most of it is cached. With all the information that I have the only thing I can think of is to add more disks to the array, DS14 supports of to 32 disks, so based on the level of throttling I am seeing you will need to have the 32 disks in the array.
I have a number of questions:
-
Do we have any idea where our I/O is going? What tools can we use to help trace what is going on?
When the filesystem was slow, we did observe that
ProjectCacheWorker
dominated the number of outstanding Sidekiq tasks (https://gitlab.com/gitlab-org/gitlab-ce/issues/23550). But this is still circumstantial; is there a way to correlate peak I/O loads with which directories/file are being written? -
Is it possible that CephFS is scrubbing, rebalancing, or doing something else bad that's causing all this I/O?
-
What are the limits we are hitting?
https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits seems to suggest 20,000 IOPS as the limit. Are we truly using that much?
-
How can we better monitor whether we are hitting these limits?
We discussed a number of options today:
- Add more disks as recommended to buy us time (downsides: $$$)
- Attempt to put in kernel-level rate limiting to prevent Azure throttling
- Limit user bandwidth at the HAProxy level (gitlab-com/infrastructure#609)
- Switch to another provider that doesn't have this issue
- Other ideas?