Disk writes stalled on Gitaly node file-38
Incident
For roughly 12 minutes, GCP Instance file-38
's GCP Persistent Disk sdb
stopped accepting write requests. This caused Gitaly threads to stall in iowait, leading to timeouts for clients' gRPC calls.
Background
Each Persistent Disk (PD) is presented to the VM as a SCSI block device. The Linux kernel on file-38
maintains counters for the read and write operations completed by each block device, and since we poll those counters, we can see that for sdb
, the write counter completely stopped incrementing for over 11 minutes. The read counter for sdb
did not show a stall, so only writes requests were failing to complete. Throughout this timespan, the other 2 PDs attached to this VM remained responsive, continuing to complete both read and write requests.
For the record, the stalled PD (sdb
) is an SSD-backed volume. (As for the other two PDs that did not stall, sda
is also an SSD, whereas sdc
is a "Standard" non-SSD PD.) The disk throughput for sdb
was nowhere near its IOPS quota. Likewise, the VM's network egress throughput was nowhere near its quota (even accounting for the fact that disk write throughput to PDs costs triple towards the network egress quota).
Potentially related previous incident
The above symptoms are similar to another incident affecting a Gitaly node (file-16
) on 2019-12-10. In the 2019-12-10 incident, the root cause was that the PD became unwritable because of an edge case in the Persistent Disk infrastructure -- the 3 backing storage nodes could not reach consensus regarding the metadata for one of the segments of that PD, so without quorum, they could not complete any write ops. GCP Support says a fix for that edge case is planned for this quarter (2020 Q1). Today's outage may or may not be another instance of that edge case.
Details
- Start of stall: 2020-01-30 06:50 UTC
- End of stall: 2020-01-30 07:01 UTC
- GCP Instance (
file-38
): https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-d/instances/file-38-stor-gprd?project=gitlab-production&q=search - GCP Persistent Disk affected (
sdb
from the kernel's perspective): https://console.cloud.google.com/compute/disksDetail/zones/us-east1-d/disks/file-38-stor-gprd-data?project=gitlab-production - Public dashboard link: https://dashboards.gitlab.com/d/bd2Kl9Imk/host-stats?orgId=1&from=1580365800000&to=1580369400000&var-environment=gprd&var-node=file-38-stor-gprd.c.gitlab-production.internal&var-promethus=prometheus-01-inf-gprd
Dashboard screenshots
CPU usage is dominated by iowait, and load average climbs steadily as more threads become blocked waiting for disk I/O
sdb
: Writes stall but reads do not
Disk throughput for Device sdb
(data volume):
sda
and sdc
: No impact; normal read and write throughput pattern throughout the incident
Disk throughput for Device sda
(root volume):
Device sdc
(log volume):
Network usage (implicitly excludes storage traffic, which also counts against VM's quota)
Network connectivity remained available throughout the incident. Network throughput remained normal. The count of TCP connections rose during the incident, presumably due to clients timing out and attempting to recover by reconnecting.
Kernel logs
Near the start of the incident, the kernel's hung-task detector logged that several processes had been stalled for over 120 seconds waiting on syscalls that needed to synchronously write to an ext4 filesystem. Examples:
-
git
callsvfs_mkdir
, which stalled waiting to create an inode. -
jbd2/sdb
(the kernel thread for writing to the ext4 journal) stalled while trying to commit a transaction. -
gitaly
callsvfs_fsync_range
on an ext4 inode, which stalled waiting to synchronously flush cached writes.
msmiley@file-38-stor-gprd.c.gitlab-production.internal:~$ sudo cat /var/log/syslog | grep 'kernel' | grep -v 'audit:' | gzip -c > $( hostname -s ).syslog.kernel.20200130.log.gz
msmiley@file-38-stor-gprd.c.gitlab-production.internal:~$ zcat file-38-stor-gprd.syslog.kernel.20200130.log.gz | wc -l
357