Disk writes stalled on Gitaly node file-38

Incident

For roughly 12 minutes, GCP Instance file-38's GCP Persistent Disk sdb stopped accepting write requests. This caused Gitaly threads to stall in iowait, leading to timeouts for clients' gRPC calls.

Background

Each Persistent Disk (PD) is presented to the VM as a SCSI block device. The Linux kernel on file-38 maintains counters for the read and write operations completed by each block device, and since we poll those counters, we can see that for sdb, the write counter completely stopped incrementing for over 11 minutes. The read counter for sdb did not show a stall, so only writes requests were failing to complete. Throughout this timespan, the other 2 PDs attached to this VM remained responsive, continuing to complete both read and write requests.

For the record, the stalled PD (sdb) is an SSD-backed volume. (As for the other two PDs that did not stall, sda is also an SSD, whereas sdc is a "Standard" non-SSD PD.) The disk throughput for sdb was nowhere near its IOPS quota. Likewise, the VM's network egress throughput was nowhere near its quota (even accounting for the fact that disk write throughput to PDs costs triple towards the network egress quota).

Potentially related previous incident

The above symptoms are similar to another incident affecting a Gitaly node (file-16) on 2019-12-10. In the 2019-12-10 incident, the root cause was that the PD became unwritable because of an edge case in the Persistent Disk infrastructure -- the 3 backing storage nodes could not reach consensus regarding the metadata for one of the segments of that PD, so without quorum, they could not complete any write ops. GCP Support says a fix for that edge case is planned for this quarter (2020 Q1). Today's outage may or may not be another instance of that edge case.

Details

Start of stall: 2020-01-30 06:50 UTC
End of stall: 2020-01-30 07:01 UTC
GCP Instance (file-38): https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-d/instances/file-38-stor-gprd?project=gitlab-production&q=search
GCP Persistent Disk affected (sdb from the kernel's perspective): https://console.cloud.google.com/compute/disksDetail/zones/us-east1-d/disks/file-38-stor-gprd-data?project=gitlab-production
Public dashboard link: https://dashboards.gitlab.com/d/bd2Kl9Imk/host-stats?orgId=1&from=1580365800000&to=1580369400000&var-environment=gprd&var-node=file-38-stor-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd

Dashboard screenshots

CPU usage is dominated by iowait, and load average climbs steadily as more threads become blocked waiting for disk I/O

Disk throughput for `sdb`: Writes stall but reads do not

Device sdb (data volume):

Disk throughput for `sda` and `sdc`: No impact; normal read and write throughput pattern throughout the incident

Device sda (root volume):

Device sdc (log volume):

Network usage (implicitly excludes storage traffic, which also counts against VM's quota)

Network connectivity remained available throughout the incident. Network throughput remained normal. The count of TCP connections rose during the incident, presumably due to clients timing out and attempting to recover by reconnecting.

Kernel logs

Near the start of the incident, the kernel's hung-task detector logged that several processes had been stalled for over 120 seconds waiting on syscalls that needed to synchronously write to an ext4 filesystem. Examples:

git calls vfs_mkdir, which stalled waiting to create an inode.
jbd2/sdb (the kernel thread for writing to the ext4 journal) stalled while trying to commit a transaction.
gitaly calls vfs_fsync_range on an ext4 inode, which stalled waiting to synchronously flush cached writes.

msmiley@file-38-stor-gprd.c.gitlab-production.internal:~$ sudo cat /var/log/syslog | grep 'kernel' | grep -v 'audit:' | gzip -c > $( hostname -s ).syslog.kernel.20200130.log.gz

msmiley@file-38-stor-gprd.c.gitlab-production.internal:~$ zcat file-38-stor-gprd.syslog.kernel.20200130.log.gz | wc -l
357

file-38-stor-gprd.syslog.kernel.20200130.log.gz

Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖