Skip to content

High read IOPS on nfs-file-10

Context

An abnormal rate of read IOPS caused nfs-file-10 to consume all the available IOPS on its disks. This resulted in GitLab.com returning errors and being generally very slow.

An fast rise in read IOPS can be noticed just before the outage suggesting a sudden increase in reads.

Timeline

On date: 2017-10-26

  • 14:21:55 UTC - nfs-file-10 started to report RPC errors in the system log.
  • 14:23:04 UTC - We've been alerted that GitLab.com was returning errors.
  • 14:24:52 UTC - Last RPC error on nfs-file-10.

Incident Analysis

  • How was the incident detected?

Humans were faster than monitoring to alert us.

  • Is there anything that could have been done to improve the time to detection?

Probably not. A more granular error detection policy could lead to false positives.

  • How was the root cause discovered?

By looking at the graphs on the Azure portal.

  • Was this incident triggered by a change?

No.

  • Was there an existing issue that would have either prevented this incident or reduced the impact?

Plans to iterate on the storage layer for GitLab.com have been in the works for a long time.

What went well

  • Only a small number of engineers were involved. They were able to pinpoint where the issue was. This meant that everyone else carried on with their normal work, reducing the issue cost.

What can be improved

  • We need a better storage layer.
  • We need to understand why the IOPS metrics are different between Prometheus and Azure.

Graphs

Bytes read/Bytes written in Azure (times in CEST)

Screen_Shot_2017-10-26_at_17.08.20

Read IOPS/Write IOPS in Azure (times in CEST)

Screen_Shot_2017-10-26_at_17.08.38

Disk utilisation in Prometheus

Screen_Shot_2017-10-26_at_17.09.54