PBS_MOM killed on job exit (xpmem_close_handler)
xpmem_close_handler is forcing a sigkill of the current thread group. In certain cases that means it is killing off the PBS_MOM. Initial guess is that we have some kind of race to free up memory when user job processes are exiting and the appropriate xpmem_detach isn't winning the race.
Original Stack trace recovered via systemtap:
0xffffffffa1170e57 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x8e57/0x0] 0xffffffffa117297e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xa97e/0x0] 0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0] 0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0] 0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0] 0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel] 0xffffffff8101bfe4 : try_stack_unwind+0x194/0x1b0 [kernel] 0xffffffff8101ae04 : dump_trace+0x64/0x3b0 [kernel] 0xffffffffa1172e88 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xae88/0x0] 0xffffffffa1172f8e [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xaf8e/0x0] 0xffffffffa1174295 [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0xc295/0x0] 0xffffffffa116801d [stap_7ea83ef4a45a84471fe8e1b5f7f3ed52_33201+0x1d/0x0] 0xffffffff810932b5 : __send_signal+0x245/0x450 [kernel] 0xffffffff810934fe : send_signal+0x3e/0x80 [kernel] 0xffffffff81093d30 : force_sig_info+0xb0/0xe0 [kernel] 0xffffffff81093d76 : force_sig+0x16/0x20 [kernel] 0xffffffffa0238a01 : xpmem_close_handler+0x151/0x270 [xpmem] 0xffffffff811d774d : remove_vma+0x2d/0x70 [kernel] 0xffffffff811db09a : exit_mmap+0xea/0x150 [kernel] 0xffffffff81082edf : mmput+0x4f/0x110 [kernel]
We enabled xpmem_debug and captured the following trace along with dmesg log.
The job was started at 16:30, so you can extract the log with (grep "Jun 11 16:3" r1i6n18.gbe.ice.issp.u-tokyo.ac.jp