pgio went nuts at Gitlab Staging
Last week, GL experienced a sub-optimal performance in a Staging node.
References:
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7932 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8026
Some log output:
[Thu Aug 29 08:34:42 2019] INFO: task kworker/u16:5:15613 blocked for more than 120 seconds.
[Thu Aug 29 08:34:42 2019] Not tainted 4.15.0-1015-gcp #15~16.04.1-Ubuntu
[Thu Aug 29 08:34:42 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Aug 29 08:34:42 2019] kworker/u16:5 D 0 15613 2 0x80000080
[Thu Aug 29 08:34:42 2019] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[Thu Aug 29 08:34:42 2019] Call Trace:
[Thu Aug 29 08:34:42 2019] __schedule+0x3db/0x8c0
[Thu Aug 29 08:34:42 2019] schedule+0x36/0x80
[Thu Aug 29 08:34:42 2019] schedule_timeout+0x1db/0x370
[Thu Aug 29 08:34:42 2019] ? wake_up_process+0x15/0x20
[Thu Aug 29 08:34:42 2019] wait_for_completion+0xb9/0x150
[Thu Aug 29 08:34:42 2019] ? wake_up_q+0x70/0x70
[Thu Aug 29 08:34:42 2019] __synchronize_srcu.part.13+0x85/0xb0
[Thu Aug 29 08:34:42 2019] ? trace_raw_output_rcu_utilization+0x50/0x50
[Thu Aug 29 08:34:42 2019] ? _raw_spin_unlock_irqrestore+0x3a/0x40
[Thu Aug 29 08:34:42 2019] synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:34:42 2019] ? synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:34:42 2019] fsnotify_mark_destroy_workfn+0x7c/0xe0
[Thu Aug 29 08:34:42 2019] process_one_work+0x152/0x410
[Thu Aug 29 08:34:42 2019] worker_thread+0x4b/0x470
[Thu Aug 29 08:34:42 2019] kthread+0x105/0x140
[Thu Aug 29 08:34:42 2019] ? process_one_work+0x410/0x410
[Thu Aug 29 08:34:42 2019] ? kthread_associate_blkcg+0xa0/0xa0
[Thu Aug 29 08:34:42 2019] ? do_syscall_64+0x7b/0x150
[Thu Aug 29 08:34:42 2019] ret_from_fork+0x3a/0x50
[Thu Aug 29 08:36:43 2019] INFO: task kworker/u16:5:15613 blocked for more than 120 seconds.
[Thu Aug 29 08:36:43 2019] Not tainted 4.15.0-1015-gcp #15~16.04.1-Ubuntu
[Thu Aug 29 08:36:43 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Aug 29 08:36:43 2019] kworker/u16:5 D 0 15613 2 0x80000080
[Thu Aug 29 08:36:43 2019] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[Thu Aug 29 08:36:43 2019] Call Trace:
[Thu Aug 29 08:36:43 2019] __schedule+0x3db/0x8c0
[Thu Aug 29 08:36:43 2019] schedule+0x36/0x80
[Thu Aug 29 08:36:43 2019] schedule_timeout+0x1db/0x370
[Thu Aug 29 08:36:43 2019] ? wake_up_process+0x15/0x20
[Thu Aug 29 08:36:43 2019] wait_for_completion+0xb9/0x150
[Thu Aug 29 08:36:43 2019] ? wake_up_q+0x70/0x70
[Thu Aug 29 08:36:43 2019] __synchronize_srcu.part.13+0x85/0xb0
[Thu Aug 29 08:36:43 2019] ? trace_raw_output_rcu_utilization+0x50/0x50
[Thu Aug 29 08:36:43 2019] ? _raw_spin_unlock_irqrestore+0x3a/0x40
[Thu Aug 29 08:36:43 2019] synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:36:43 2019] ? synchronize_srcu+0x7e/0x130
[Thu Aug 29 08:36:43 2019] fsnotify_mark_destroy_workfn+0x7c/0xe0
[Thu Aug 29 08:36:43 2019] process_one_work+0x152/0x410
[Thu Aug 29 08:36:43 2019] worker_thread+0x4b/0x470
[Thu Aug 29 08:36:43 2019] kthread+0x105/0x140
[Thu Aug 29 08:36:43 2019] ? process_one_work+0x410/0x410
[Thu Aug 29 08:36:43 2019] ? kthread_associate_blkcg+0xa0/0xa0
[Thu Aug 29 08:36:43 2019] ? do_syscall_64+0x7b/0x150
[Thu Aug 29 08:36:43 2019] ret_from_fork+0x3a/0x50