Skip to content

[#1083] Salvage relinkctl file latch from dead pid in 1 second (not 30 seconds)

Narayanan Iyer requested to merge nars1/YDB:ydb1083 into master

Background

Below issue was discovered while analyzing #1083 (closed). The text is pasted from #1083 (comment 1931300377)

  • I did notice that once a process ended up with a SIG-11, the next process to try attach to the relinkctl shared memory took 1 minute to recover the latch (that the SIG-11 process took but did not release). There were 2 latches that were obtained which need to be recovered/salvaged. And each of them took 30 seconds for a total of 1 minute. The logic is supposed to check for a dead latch holder pid a lot faster than 30 seconds. But it did not.

Issue

  • There are many calls to grab_latch() from 2 functions rtnobj_shm_malloc() and rtnobj_shm_free() as shown below.

    sr_unix/rtnobj.c

      465 sm_uc_ptr_t rtnobj_shm_malloc(zro_hist *zhist, int fd, off_t objSize, gtm_uint64_t objhash)
      466 {
      516         if (!grab_latch(&relinkrec->rtnobj_latch, RLNKREC_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
      530                         if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
      636                         if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
    
      988 void    rtnobj_shm_free(rhdtyp *rhead, boolean_t latch_grabbed)
      989 {
     1061         if (!latch_grabbed && !grab_latch(&relinkrec->rtnobj_latch, RLNKREC_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
     1117         if (!grab_latch(&shm_hdr->relinkctl_latch, RLNKSHM_LATCH_TIMEOUT_SEC, NOT_APPLICABLE, NULL))
  • If a process terminates abnormally after having done a grab_latch() call but before it does a rel_latch() call, the next process that tries to use a relinkctl file shared memory segment to auto relink an M routine will be waiting for a really long time (30 seconds) before it decides to check for a dead process that holds the latch and will then clear the lock holder (i.e. salvage it).

  • The salvage happens as part of performCASLatchCheck() call inside the REST_FOR_LATCH macro in line 87 below.

    sr_unix/grab_latch.

       85                 if (!skip_time_calc)
       86                 {
       87                         REST_FOR_LATCH(latch, USEC_IN_NSEC_MASK, retries);
       88                         sys_get_curr_time(&cur_time);
       89                         remain_time = sub_abs_time(&end_time, &cur_time);
       90                         if (0 > remain_time.tv_sec)
       91                                 break;
  • But inside the REST_FOR_LATCH macro, we do the performCASLatchCheck() call in line 71 below only if the if check at line 70 is TRUE. And that turns out to be TRUE only after a really long wait.

    sr_unix/gtm_rel_quant.h

       63 /* Sleep/rel_quant <= 1 micro-second every 4 iterations and also perform caslatch check every ~4 seconds */
       64 #define REST_FOR_LATCH(LATCH, MAX_SLEEP_MASK, RETRIES)                                                                  \
       65 MBSTART {                                                                                                               \
       66         if (0 == (RETRIES & LOCK_SPIN_HARD_MASK))       /* On every so many passes, sleep rather than spinning */       \
       67         {                                                                                                               \
       68                 GTM_REL_QUANT((MAX_SLEEP_MASK));        /* Release processor to holder of lock (hopefully) */           \
       69                 /* Check if we're due to check for lock abandonment check or holder wakeup */                           \
       70                 if (0 == (RETRIES & (LOCK_CASLATCH_CHKINTVL_USEC - 1)))                                                 \
       71                         performCASLatchCheck(LATCH, TRUE);                                                              \
       72         }                                                                                                               \
       73 } MBEND
  • It is because of the value of the LOCK_CASLATCH_CHKINTVL_USEC macro.

Fix

  • In all calls to grab_latch() above, a timeout is passed. And in that case, we already check the current time after a spin loop. This can be seen in line 88 above.

  • Therefore, we can add the performCASLatchCheck() call if we know at least 1 second has elapsed between checks. And this is a better check than an iteration based check which is not guaranteed to happen after a certain elapsed time.

  • Therefore, the REST_FOR_LATCH call is removed and instead replaced with a GTM_REL_QUANT() call just like the macro had it. And a separate performCASLatchCheck() call is added based on elapsed time instead of the iterations-based check that is in the REST_FOR_LATCH macro.

  • This ensures we will do a dead pid check once a second or so which is more user-friendly than doing it once every 30 seconds or so.

Merge request reports