2018-07-02 - Azure host update on storage nodes caused an increase of errors and impacted access to customer repositories

Summary

On July 2nd host maintenance was done across many VMs in Azure which resulted in storage and other hosts to go unavailable. The following hosts were affected:

prometheus-02
nfs-06
nfs-03
nfs-10
nfs-16
nfs-19
nfs-11

Timeline of events

2018-07-01 ~ 22:00 UTC - First pages come in for nfs-11 and prometheus-2 being unreachable.
2018-07-02 00:00 UTC - tracking issues from pingdom alerts for GitLab.com availability.
2018-07-02 03:00 UTC - The issues appear to be from our hosting in Azure seeing: " INFO: rcu_sched detected stalls on CPUs/tasks " on nfs-11, nfs-16, and nfs-19
2018-07-02 04:00 nfs-11 and nfs-19 appear to have healed after gitaly restarts. We tried redeploying nfs-16 from the Azure portal and once back, GitLab.com was back to normal operations.

Internal tracking doc: https://docs.google.com/document/d/1gMNTO61lw5daZyiODVvSHEsAJalNS-K2oGp6SzcfdoY/edit

Monitoring

Logs

Example from kern.log on nfs-16

[Mon Jul  2 02:15:06 2018] hv_utils: Heartbeat IC version 3.0
[Mon Jul  2 02:15:06 2018] hv_utils: Shutdown IC version 3.0
[Mon Jul  2 02:15:06 2018] hv_utils: VSS IC version 5.0
[Mon Jul  2 02:27:09 2018] hv_balloon: INFO_TYPE_MAX_PAGE_CNT = 14680064
[Mon Jul  2 02:44:01 2018] INFO: rcu_sched detected stalls on CPUs/tasks:
[Mon Jul  2 02:44:01 2018]      1-...: (14732 GPs behind) idle=f6c/0/0 softirq=1603277934/1603277935 fqs=0 
[Mon Jul  2 02:44:01 2018]      3-...: (0 ticks this GP) idle=17c/0/0 softirq=1585541660/1585541660 fqs=0 
[Mon Jul  2 02:44:01 2018]      4-...: (14732 GPs behind) idle=5f6/0/0 softirq=1586313422/1586313423 fqs=0 
[Mon Jul  2 02:44:01 2018]      7-...: (14732 GPs behind) idle=f10/0/0 softirq=1576574670/1576574671 fqs=0 
[Mon Jul  2 02:44:01 2018]      (detected by 0, t=15010 jiffies, g=490844846, c=490844845, q=30866)
[Mon Jul  2 02:44:01 2018] Sending NMI from CPU 0 to CPUs 1:
[Mon Jul  2 02:44:01 2018] NMI backtrace for cpu 1 skipped: idling at pc 0xffffffffba21c9c6
[Mon Jul  2 02:44:01 2018] Sending NMI from CPU 0 to CPUs 3:
[Mon Jul  2 02:44:01 2018] NMI backtrace for cpu 3 skipped: idling at pc 0xffffffffba21c9c6
[Mon Jul  2 02:44:01 2018] Sending NMI from CPU 0 to CPUs 4:
[Mon Jul  2 02:44:01 2018] NMI backtrace for cpu 4 skipped: idling at pc 0xffffffffba21c9c6
[Mon Jul  2 02:44:01 2018] Sending NMI from CPU 0 to CPUs 7:
[Mon Jul  2 02:44:01 2018] NMI backtrace for cpu 7 skipped: idling at pc 0xffffffffba21c9c6
[Mon Jul  2 02:44:01 2018] rcu_sched kthread starved for 15012 jiffies! g490844846 c490844845 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[Mon Jul  2 02:44:01 2018] rcu_sched       S    0     8      2 0x00000000
[Mon Jul  2 02:44:01 2018] Call Trace:
[Mon Jul  2 02:44:01 2018]  __schedule+0x3b9/0x8f0
[Mon Jul  2 02:44:01 2018]  schedule+0x36/0x80
[Mon Jul  2 02:44:01 2018]  schedule_timeout+0x188/0x310
[Mon Jul  2 02:44:01 2018]  ? del_timer_sync+0x50/0x50
[Mon Jul  2 02:44:01 2018]  rcu_gp_kthread+0x551/0x910
[Mon Jul  2 02:44:01 2018]  ? rcu_gp_kthread+0x551/0x910
[Mon Jul  2 02:44:01 2018]  kthread+0x109/0x140
[Mon Jul  2 02:44:01 2018]  ? rcu_note_context_switch+0x100/0x100
[Mon Jul  2 02:44:01 2018]  ? kthread_create_on_node+0x70/0x70
[Mon Jul  2 02:44:01 2018]  ret_from_fork+0x25/0x30
[Mon Jul  2 02:47:01 2018] INFO: rcu_sched detected stalls on CPUs/tasks:
[Mon Jul  2 02:47:01 2018]      3-...: (0 ticks this GP) idle=aee/0/0 softirq=1585541660/1585541660 fqs=1 
[Mon Jul  2 02:47:01 2018]      (detected by 5, t=60009 jiffies, g=490844846, c=490844845, q=40843)

Edited Jul 02, 2018 by John Jarvis