[#1083] Salvage relinkctl file latch from dead pid in 1 second (not 30 seconds) (!1530) · Merge requests · YottaDB / DB / YDB

Narayanan Iyer requested to merge nars1/YDB:ydb1083 into master May 31, 2024

Background

Below issue was discovered while analyzing #1083 (closed). The text is pasted from #1083 (comment 1931300377)

I did notice that once a process ended up with a SIG-11, the next process to try attach to the relinkctl shared memory took 1 minute to recover the latch (that the SIG-11 process took but did not release). There were 2 latches that were obtained which need to be recovered/salvaged. And each of them took 30 seconds for a total of 1 minute. The logic is supposed to check for a dead latch holder pid a lot faster than 30 seconds. But it did not.

Issue

There are many calls to grab_latch() from 2 functions rtnobj_shm_malloc() and rtnobj_shm_free() as shown below.

sr_unix/rtnobj.c

  465 sm_uc_ptr_t rtnobj_shm_malloc(zro_hist *zhist, int fd, off_t objSize, gtm_uint64_t objhash)
  466 {
  516         if (!grab_latch(&relinkrec->rtnobj_latch, RLNKREC_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
  530                         if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
  636                         if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))

  988 void    rtnobj_shm_free(rhdtyp *rhead, boolean_t latch_grabbed)
  989 {
 1061         if (!latch_grabbed && !grab_latch(&relinkrec->rtnobj_latch, RLNKREC_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
 1117         if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))

If a process terminates abnormally after having done a grab_latch() call but before it does a rel_latch() call, the next process that tries to use a relinkctl file shared memory segment to auto relink an M routine will be waiting for a really long time (30 seconds) before it decides to check for a dead process that holds the latch and will then clear the lock holder (i.e. salvage it).

The salvage happens as part of performCASLatchCheck() call inside the REST_FOR_LATCH macro in line 87 below.

sr_unix/grab_latch.

   85                 if (!skip_time_calc)
   86                 {
   87                         REST_FOR_LATCH(latch, USEC_IN_NSEC_MASK, retries);
   88                         sys_get_curr_time(&cur_time);
   89                         remain_time = sub_abs_time(&end_time, &cur_time);
   90                         if (0 > remain_time.tv_sec)
   91                                 break;

But inside the REST_FOR_LATCH macro, we do the performCASLatchCheck() call in line 71 below only if the if check at line 70 is TRUE. And that turns out to be TRUE only after a really long wait.

sr_unix/gtm_rel_quant.h

   63 /* Sleep/rel_quant <= 1 micro-second every 4 iterations and also perform caslatch check every ~4 seconds */
   64 #define REST_FOR_LATCH(LATCH, MAX_SLEEP_MASK, RETRIES)                                                                  \
   65 MBSTART {                                                                                                               \
   66         if (0 == (RETRIES & LOCK_SPIN_HARD_MASK))       /* On every so many passes, sleep rather than spinning */       \
   67         {                                                                                                               \
   68                 GTM_REL_QUANT((MAX_SLEEP_MASK));        /* Release processor to holder of lock (hopefully) */           \
   69                 /* Check if we're due to check for lock abandonment check or holder wakeup */                           \
   70                 if (0 == (RETRIES & (LOCK_CASLATCH_CHKINTVL_USEC - 1)))                                                 \
   71                         performCASLatchCheck(LATCH, TRUE);                                                              \
   72         }                                                                                                               \
   73 } MBEND

It is because of the value of the LOCK_CASLATCH_CHKINTVL_USEC macro.

Fix

In all calls to grab_latch() above, a timeout is passed. And in that case, we already check the current time after a spin loop. This can be seen in line 88 above.
Therefore, we can add the performCASLatchCheck() call if we know at least 1 second has elapsed between checks. And this is a better check than an iteration based check which is not guaranteed to happen after a certain elapsed time.
Therefore, the REST_FOR_LATCH call is removed and instead replaced with a GTM_REL_QUANT() call just like the macro had it. And a separate performCASLatchCheck() call is added based on elapsed time instead of the iterations-based check that is in the REST_FOR_LATCH macro.
This ensures we will do a dead pid check once a second or so which is more user-friendly than doing it once every 30 seconds or so.

[#1083] Salvage relinkctl file latch from dead pid in 1 second (not 30 seconds)

Background

Issue

Fix

Merge request reports