Investigate frequent bursts of stalled disk IO on Gitaly nodes
Some VMs have frequent bursts of disk I/O operations being aborted for one or more mounted Persistent Disk (PD), often followed by the kernel issuing a SCSI bus reset on that block device to try to clear the stall.
- This appears to predominantly affect our Gitaly nodes and shared NFS server, all of which do lots of physical disk I/O.
- This behavior is probably not new, but my initial survey only covers the last 3 days, with spot-checking of individual hosts for longer timespans to confirm it is indeed a standing problem.
- I suspect this is an effect of PDs being biased for consistency over availability.
- Further I suspect this may share the same root cause as the soft lockup kernel panics that distinctively and consistently have kernel stack traces of synchronous writes stalling for over 22 seconds (i.e.
vfs_writeleading tonew_sync_write). The difference between outcomes (kernel panic versus just aborted I/O requests and a bus reset) may be just whether or not the stalled op was being run synchronously or asynchronously.