pgio went nuts at Gitlab Staging

Last week, GL experienced a sub-optimal performance in a Staging node.

References:

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7932 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8026

Some log output:

[Thu Aug 29 08:34:42 2019] INFO: task kworker/u16:5:15613 blocked for more than 120 seconds.
[Thu Aug 29 08:34:42 2019]       Not tainted 4.15.0-1015-gcp #15~16.04.1-Ubuntu
[Thu Aug 29 08:34:42 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Aug 29 08:34:42 2019] kworker/u16:5   D    0 15613      2 0x80000080
[Thu Aug 29 08:34:42 2019] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[Thu Aug 29 08:34:42 2019] Call Trace:
[Thu Aug 29 08:34:42 2019]  __schedule+0x3db/0x8c0
[Thu Aug 29 08:34:42 2019]  schedule+0x36/0x80
[Thu Aug 29 08:34:42 2019]  schedule_timeout+0x1db/0x370
[Thu Aug 29 08:34:42 2019]  ? wake_up_process+0x15/0x20
[Thu Aug 29 08:34:42 2019]  wait_for_completion+0xb9/0x150
[Thu Aug 29 08:34:42 2019]  ? wake_up_q+0x70/0x70
[Thu Aug 29 08:34:42 2019]  __synchronize_srcu.part.13+0x85/0xb0
[Thu Aug 29 08:34:42 2019]  ? trace_raw_output_rcu_utilization+0x50/0x50
[Thu Aug 29 08:34:42 2019]  ? _raw_spin_unlock_irqrestore+0x3a/0x40
[Thu Aug 29 08:34:42 2019]  synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:34:42 2019]  ? synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:34:42 2019]  fsnotify_mark_destroy_workfn+0x7c/0xe0
[Thu Aug 29 08:34:42 2019]  process_one_work+0x152/0x410
[Thu Aug 29 08:34:42 2019]  worker_thread+0x4b/0x470
[Thu Aug 29 08:34:42 2019]  kthread+0x105/0x140
[Thu Aug 29 08:34:42 2019]  ? process_one_work+0x410/0x410
[Thu Aug 29 08:34:42 2019]  ? kthread_associate_blkcg+0xa0/0xa0
[Thu Aug 29 08:34:42 2019]  ? do_syscall_64+0x7b/0x150
[Thu Aug 29 08:34:42 2019]  ret_from_fork+0x3a/0x50
[Thu Aug 29 08:36:43 2019] INFO: task kworker/u16:5:15613 blocked for more than 120 seconds.
[Thu Aug 29 08:36:43 2019]       Not tainted 4.15.0-1015-gcp #15~16.04.1-Ubuntu
[Thu Aug 29 08:36:43 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Aug 29 08:36:43 2019] kworker/u16:5   D    0 15613      2 0x80000080
[Thu Aug 29 08:36:43 2019] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[Thu Aug 29 08:36:43 2019] Call Trace:
[Thu Aug 29 08:36:43 2019]  __schedule+0x3db/0x8c0
[Thu Aug 29 08:36:43 2019]  schedule+0x36/0x80
[Thu Aug 29 08:36:43 2019]  schedule_timeout+0x1db/0x370
[Thu Aug 29 08:36:43 2019]  ? wake_up_process+0x15/0x20
[Thu Aug 29 08:36:43 2019]  wait_for_completion+0xb9/0x150
[Thu Aug 29 08:36:43 2019]  ? wake_up_q+0x70/0x70
[Thu Aug 29 08:36:43 2019]  __synchronize_srcu.part.13+0x85/0xb0
[Thu Aug 29 08:36:43 2019]  ? trace_raw_output_rcu_utilization+0x50/0x50
[Thu Aug 29 08:36:43 2019]  ? _raw_spin_unlock_irqrestore+0x3a/0x40
[Thu Aug 29 08:36:43 2019]  synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:36:43 2019]  ? synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:36:43 2019]  fsnotify_mark_destroy_workfn+0x7c/0xe0
[Thu Aug 29 08:36:43 2019]  process_one_work+0x152/0x410
[Thu Aug 29 08:36:43 2019]  worker_thread+0x4b/0x470
[Thu Aug 29 08:36:43 2019]  kthread+0x105/0x140
[Thu Aug 29 08:36:43 2019]  ? process_one_work+0x410/0x410
[Thu Aug 29 08:36:43 2019]  ? kthread_associate_blkcg+0xa0/0xa0
[Thu Aug 29 08:36:43 2019]  ? do_syscall_64+0x7b/0x150
[Thu Aug 29 08:36:43 2019]  ret_from_fork+0x3a/0x50