High read IOPS on nfs-file-10

Context

An abnormal rate of read IOPS caused nfs-file-10 to consume all the available IOPS on its disks. This resulted in GitLab.com returning errors and being generally very slow.

An fast rise in read IOPS can be noticed just before the outage suggesting a sudden increase in reads.

Timeline

On date: 2017-10-26

14:21:55 UTC - nfs-file-10 started to report RPC errors in the system log.
14:23:04 UTC - We've been alerted that GitLab.com was returning errors.
14:24:52 UTC - Last RPC error on nfs-file-10.

Incident Analysis

How was the incident detected?

Humans were faster than monitoring to alert us.

Is there anything that could have been done to improve the time to detection?

Probably not. A more granular error detection policy could lead to false positives.

How was the root cause discovered?

By looking at the graphs on the Azure portal.

Was this incident triggered by a change?

No.

Was there an existing issue that would have either prevented this incident or reduced the impact?

Plans to iterate on the storage layer for GitLab.com have been in the works for a long time.

What went well

Only a small number of engineers were involved. They were able to pinpoint where the issue was. This meant that everyone else carried on with their normal work, reducing the issue cost.

What can be improved

We need a better storage layer.
We need to understand why the IOPS metrics are different between Prometheus and Azure.

High read IOPS on nfs-file-10

Context

Timeline

Incident Analysis

What went well

What can be improved

Graphs

Bytes read/Bytes written in Azure (times in CEST)

Read IOPS/Write IOPS in Azure (times in CEST)

Disk utilisation in Prometheus