Walking Away from CephFS on Azure
Over the past month as we have loaded more projects, users, and CI artifacts on to CephFS one thing has become painfully clear - Microsoft Azure is not meant to provide the IOPS performance that a high performance distributed filesystem needs in order to be successful. When we first started having major issues with Azure (as noted in gitlab-com/infrastructure#520 ) we reached out for technical support, the end result being the Azure technicians informing us that they were rate limiting our disk IO:
Above (provided by Microsoft Azure) is the throttling of one host for one hour during an average day of utilization for GitLab. In response to that, we added three more OSD nodes, 60 new OSD disk targets, and rebalanced the Ceph cluster to try and ensure that we weren't bumping up against the IO limitation of Azure. We also took our most intensive IO hot-spots and spread them over a 12 stripe Ceph configuration to decrease the IO load. This past week we again saw performance continue to decline, digging into the metrics we began to look at native IO latency of the mounted disks provided by Azure on the OSD servers. This is an example of as close to the metal as we can measure IO latency and it's abysmal:
This graph is showing that /dev/sdn1
, which is presented up as osd-184
on ceph-osd9
had a 20,000ms (20s) delayed IO access to the attached disk - for NO apparent reason.
When we look at a random disk over a larger span of time we see this is far from being a one-off incident:
This graph shows that /dev/sdaa
on ceph-osd9
has multiple 10s of second IO delays throughout a given 24 hour period.
The significance of the last graph is doubly so as the /dev/sdaa
drive on all of the ceph-osd[1-13]
servers serves as the journal disk for the rest of the attached disks. This disk is used to journal all Ceph transactions, so when it has an IO problem, ALL of the disks on the host that Ceph is using have an issue as they can no longer journal data.
When we look at the composite of the /dev/sdaa
devices across the entire Ceph fleet for a 24 hour period, we see a disturbing trend:
This graph demonstrates that Azure had a network 'pause' which affected all direct attached disk IO for 40s which we were never informed about by Azure or made aware of through their service notification tools.
When we further pull back to look at a composite of the /dev/sdaa
drives over a multi-day period the results become disastrous:
This graph demonstrates that on several days we had IO latency that reached almost 1 min on the journal disks that gate performance to the whole Ceph cluster.