[#1083] Salvage relinkctl file latch from dead pid in 1 second (not 30 seconds)
Background
Below issue was discovered while analyzing #1083 (closed). The text is pasted from #1083 (comment 1931300377)
- I did notice that once a process ended up with a SIG-11, the next process to try attach to the relinkctl shared memory took 1 minute to recover the latch (that the SIG-11 process took but did not release). There were 2 latches that were obtained which need to be recovered/salvaged. And each of them took 30 seconds for a total of 1 minute. The logic is supposed to check for a dead latch holder pid a lot faster than 30 seconds. But it did not.
Issue
-
There are many calls to
grab_latch()
from 2 functionsrtnobj_shm_malloc()
andrtnobj_shm_free()
as shown below.sr_unix/rtnobj.c
465 sm_uc_ptr_t rtnobj_shm_malloc(zro_hist *zhist, int fd, off_t objSize, gtm_uint64_t objhash) 466 { 516 if (!grab_latch(&relinkrec->rtnobj_latch, RLNKREC_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL)) 530 if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL)) 636 if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL)) 988 void rtnobj_shm_free(rhdtyp *rhead, boolean_t latch_grabbed) 989 { 1061 if (!latch_grabbed && !grab_latch(&relinkrec->rtnobj_latch, RLNKREC_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL)) 1117 if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
-
If a process terminates abnormally after having done a
grab_latch()
call but before it does arel_latch()
call, the next process that tries to use a relinkctl file shared memory segment to auto relink an M routine will be waiting for a really long time (30 seconds) before it decides to check for a dead process that holds the latch and will then clear the lock holder (i.e. salvage it). -
The salvage happens as part of
performCASLatchCheck()
call inside theREST_FOR_LATCH
macro in line 87 below.sr_unix/grab_latch.
85 if (!skip_time_calc) 86 { 87 REST_FOR_LATCH(latch, USEC_IN_NSEC_MASK, retries); 88 sys_get_curr_time(&cur_time); 89 remain_time = sub_abs_time(&end_time, &cur_time); 90 if (0 > remain_time.tv_sec) 91 break;
-
But inside the
REST_FOR_LATCH
macro, we do theperformCASLatchCheck()
call in line 71 below only if theif
check at line 70 is TRUE. And that turns out to be TRUE only after a really long wait.sr_unix/gtm_rel_quant.h
63 /* Sleep/rel_quant <= 1 micro-second every 4 iterations and also perform caslatch check every ~4 seconds */ 64 #define REST_FOR_LATCH(LATCH, MAX_SLEEP_MASK, RETRIES) \ 65 MBSTART { \ 66 if (0 == (RETRIES & LOCK_SPIN_HARD_MASK)) /* On every so many passes, sleep rather than spinning */ \ 67 { \ 68 GTM_REL_QUANT((MAX_SLEEP_MASK)); /* Release processor to holder of lock (hopefully) */ \ 69 /* Check if we're due to check for lock abandonment check or holder wakeup */ \ 70 if (0 == (RETRIES & (LOCK_CASLATCH_CHKINTVL_USEC - 1))) \ 71 performCASLatchCheck(LATCH, TRUE); \ 72 } \ 73 } MBEND
-
It is because of the value of the
LOCK_CASLATCH_CHKINTVL_USEC
macro.
Fix
-
In all calls to
grab_latch()
above, a timeout is passed. And in that case, we already check the current time after a spin loop. This can be seen in line 88 above. -
Therefore, we can add the
performCASLatchCheck()
call if we know at least 1 second has elapsed between checks. And this is a better check than an iteration based check which is not guaranteed to happen after a certain elapsed time. -
Therefore, the
REST_FOR_LATCH
call is removed and instead replaced with aGTM_REL_QUANT()
call just like the macro had it. And a separateperformCASLatchCheck()
call is added based on elapsed time instead of the iterations-based check that is in theREST_FOR_LATCH
macro. -
This ensures we will do a dead pid check once a second or so which is more user-friendly than doing it once every 30 seconds or so.