Alertmanager stuck on CPU#3
I got a 100% cpu load on CPU 3 and messages in every console that alertmanager is stucking on the cpu.
The only way to kill the process was to reset the server (rebooting was not possible).
Using gitlab-ce 11.8.3 omnibus.
dmesg:
[121994.895231] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay cfg80211 rfkill cpufreq_powersave cpufreq_conservative cpufreq_userspace intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt irqbypass iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel evdev lpc_ich intel_cstate serio_raw mxm_wmi sg shpchp ppdev mfd_core intel_uncore intel_rapl_perf button video parport_pc parport wmi ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
[121994.897767] xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear raid1 md_mod sd_mod crc32c_intel ahci libahci aesni_intel aes_x86_64 glue_helper lrw libata gf128mul ablk_helper cryptd xhci_pci ehci_pci i2c_i801 xhci_hcd ehci_hcd i2c_smbus scsi_mod r8169 mii usbcore usb_common fan thermal
[121994.898944] CPU: 3 PID: 18357 Comm: alertmanager Tainted: G B L 4.9.0-8-amd64 #1 Debian 4.9.130-2
[121994.899013] Hardware name: MSI MS-7816/H87-G43 (MS-7816), BIOS V2.14B14 07/13/2018
[121994.899079] task: ffffa01fe9b4e040 task.stack: ffffbdc78a254000
[121994.899132] RIP: 0010:[<ffffffff90382c4f>] [<ffffffff90382c4f>] filemap_map_pages+0xaf/0x3e0
[121994.899265] RSP: 0000:ffffbdc78a257db8 EFLAGS: 00000246
[121994.899317] RAX: dead000000000100 RBX: ffffde131e64c340 RCX: 0000000000000000
[121994.899382] RDX: ffffde131e64c340 RSI: 0000000000000011 RDI: 0000000000000001
[121994.899447] RBP: ffffa0215b42daa0 R08: 000000000001bc60 R09: 000000000000003f
[121994.899512] R10: 0000000000000020 R11: 0000000000000000 R12: 000000000000002f
[121994.899580] R13: ffffbdc78a257e70 R14: 0400000000000080 R15: 0000000000000001
[121994.899648] FS: 00007f236ffe6700(0000) GS:ffffa021deac0000(0000) knlGS:0000000000000000
[121994.899754] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[121994.899806] CR2: 000000000042a790 CR3: 000000060b4a8000 CR4: 0000000000160670
[121994.899871] Stack:
[121994.899920] ffffa02165f838d8 ffffa02165f838d0 000000000000002e ffffa020e3014600
[121994.900112] ffffde131e64c300 000000000000002f 0000000000000040 ffffbdc78a257e68
[121994.900300] 0000000000000000 2e37a3f414ade478 ffffa01fec75fe10 000000000000002f
[121994.900513] Call Trace:
[121994.900563] [<ffffffff903babb6>] ? handle_mm_fault+0xde6/0x1310
[121994.900617] [<ffffffff902622d5>] ? __do_page_fault+0x255/0x4f0
[121994.900670] [<ffffffff903c2950>] ? SyS_brk+0x160/0x180
[121994.900723] [<ffffffff9081a358>] ? page_fault+0x28/0x30
[121994.900774] Code: 00 48 8b 55 00 48 85 d2 0f 84 8e 00 00 00 48 89 d0 83 e0 03 0f 85 1b 01 00 00 48 8b 42 20 48 8d 58 ff a8 01 48 0f 44 da 8b 4b 1c <85> c9 74 d2 8d 79 01 48 8d 73 1c 89 c8 f0 0f b1 7b 1c 39 c1 89
[122022.896622] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [alertmanager:18357]
[122022.896721] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay cfg80211 rfkill cpufreq_powersave cpufreq_conservative cpufreq_userspace intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt irqbypass iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel evdev lpc_ich intel_cstate serio_raw mxm_wmi sg shpchp ppdev mfd_core intel_uncore intel_rapl_perf button video parport_pc parport wmi ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
[122022.899976] xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear raid1 md_mod sd_mod crc32c_intel ahci libahci aesni_intel aes_x86_64 glue_helper lrw libata gf128mul ablk_helper cryptd xhci_pci ehci_pci i2c_i801 xhci_hcd ehci_hcd i2c_smbus scsi_mod r8169 mii usbcore usb_common fan thermal
[122022.901186] CPU: 3 PID: 18357 Comm: alertmanager Tainted: G B L 4.9.0-8-amd64 #1 Debian 4.9.130-2
[122022.901256] Hardware name: MSI MS-7816/H87-G43 (MS-7816), BIOS V2.14B14 07/13/2018
[122022.901322] task: ffffa01fe9b4e040 task.stack: ffffbdc78a254000
[122022.901374] RIP: 0010:[<ffffffff90382c4f>] [<ffffffff90382c4f>] filemap_map_pages+0xaf/0x3e0
[122022.901476] RSP: 0000:ffffbdc78a257db8 EFLAGS: 00000246
[122022.901538] RAX: dead000000000100 RBX: ffffde131e64c340 RCX: 0000000000000000
[122022.901642] RDX: ffffde131e64c340 RSI: 0000000000000011 RDI: 0000000000000001
[122022.901727] RBP: ffffa0215b42daa0 R08: 000000000001bc60 R09: 000000000000003f
[122022.901792] R10: 0000000000000020 R11: 0000000000000000 R12: 000000000000002f
[122022.901857] R13: ffffbdc78a257e70 R14: 0400000000000080 R15: 0000000000000001
[122022.901923] FS: 00007f236ffe6700(0000) GS:ffffa021deac0000(0000) knlGS:0000000000000000
[122022.902000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[122022.902082] CR2: 000000000042a790 CR3: 000000060b4a8000 CR4: 0000000000160670
[122022.902165] Stack:
[122022.902211] ffffa02165f838d8 ffffa02165f838d0 000000000000002e ffffa020e3014600
[122022.902460] ffffde131e64c300 000000000000002f 0000000000000040 ffffbdc78a257e68
[122022.902670] 0000000000000000 2e37a3f414ade478 ffffa01fec75fe10 000000000000002f
[122022.902860] Call Trace:
[122022.902909] [<ffffffff903babb6>] ? handle_mm_fault+0xde6/0x1310
[122022.902963] [<ffffffff902622d5>] ? __do_page_fault+0x255/0x4f0
[122022.903015] [<ffffffff903c2950>] ? SyS_brk+0x160/0x180
[122022.903068] [<ffffffff9081a358>] ? page_fault+0x28/0x30
[122022.903119] Code: 00 48 8b 55 00 48 85 d2 0f 84 8e 00 00 00 48 89 d0 83 e0 03 0f 85 1b 01 00 00 48 8b 42 20 48 8d 58 ff a8 01 48 0f 44 da 8b 4b 1c <85> c9 74 d2 8d 79 01 48 8d 73 1c 89 c8 f0 0f b1 7b 1c 39 c1 89
[122032.933154] INFO: rcu_sched self-detected stall on CPU
[122032.933272] 3-...: (1328365 ticks this GP) idle=f95/140000000000001/0 softirq=3322082/3322082 fqs=623317
[122032.933341] (t=1328587 jiffies g=5397083 c=5397082 q=2582883)
[122032.933425] Task dump for CPU 3:
[122032.933473] alertmanager R running task 0 18357 1932 0x0000000c
[122032.933591] ffffffff90f18ec0 ffffffff902a7dcb 0000000000000003 ffffffff90f18ec0
[122032.933780] ffffffff9038112b ffffa021dead96c0 ffffffff90e4fd00 0000000000000000
[122032.933969] ffffffff90f18ec0 00000000ffffffff ffffffff902e36fa 0000000000000001
[122032.934159] Call Trace:
[122032.934207] <IRQ>
[122032.934244] [<ffffffff902a7dcb>] ? sched_show_task+0xcb/0x130
[122032.934340] [<ffffffff9038112b>] ? rcu_dump_cpu_stacks+0x92/0xb2
[122032.934395] [<ffffffff902e36fa>] ? rcu_check_callbacks+0x75a/0x8b0
[122032.934449] [<ffffffff902f9c30>] ? tick_sched_do_timer+0x30/0x30
[122032.934502] [<ffffffff902ea2d8>] ? update_process_times+0x28/0x50
[122032.934555] [<ffffffff902f9630>] ? tick_sched_handle.isra.12+0x20/0x50
[122032.934609] [<ffffffff902f9c68>] ? tick_sched_timer+0x38/0x70
[122032.934662] [<ffffffff902eadae>] ? __hrtimer_run_queues+0xde/0x250
[122032.934715] [<ffffffff902eb48c>] ? hrtimer_interrupt+0x9c/0x1a0
[122032.934769] [<ffffffff9081c507>] ? smp_apic_timer_interrupt+0x47/0x60
[122032.934823] [<ffffffff9081ada6>] ? apic_timer_interrupt+0x96/0xa0
[122032.934875] <EOI>
[122032.934911] [<ffffffff90382c4f>] ? filemap_map_pages+0xaf/0x3e0
[122032.935006] [<ffffffff90382f72>] ? filemap_map_pages+0x3d2/0x3e0
[122032.935059] [<ffffffff903babb6>] ? handle_mm_fault+0xde6/0x1310
[122032.935113] [<ffffffff902622d5>] ? __do_page_fault+0x255/0x4f0
[122032.935165] [<ffffffff903c2950>] ? SyS_brk+0x160/0x180
[122032.935217] [<ffffffff9081a358>] ? page_fault+0x28/0x30
[122058.898537] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [alertmanager:18357]