Skip to content

[NARS1] [estess] Handle the possibility of an UNWIND in a condition handler in threaded code

Narayanan Iyer requested to merge nars1:pthreadlock into master
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 #3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 #4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 #5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 #6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 #7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 #8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 #9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 #10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 #11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 #12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 #13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 #14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 #15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 #16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 #17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 #18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 #19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

Merge request reports