1. 12 Mar, 2019 2 commits
    • Narayanan Iyer's avatar
      [DEBUG_ONLY] Enhance read_regions() in gtmsource_readfiles.c in debug builds... · cb70f10b
      Narayanan Iyer authored
      [DEBUG_ONLY] Enhance read_regions() in gtmsource_readfiles.c in debug builds to capture more information in case a given seqno will never be found (TR_WILL_NOT_BE_FOUND)
      
      We occasionally see test failures where the source server fails an assert in debug builds because
      it decided a given seqno can never be found in any of the journal files even though that seqno
      is later seen to be present in one of the journal files. The assert that fails is the following.
      
      %YDB-F-ASSERT, Assert failed in sr_unix/gtmsource_readfiles.c line 1922 for expression (!*brkn_trans || (ydb_white_box_test_case_enabled && ((WBTEST_REPLBRKNTRANS == ydb_white_box_test_case_number) || (WBTEST_MURUNDOWN_KILLCMT06 == ydb_white_box_test_case_number) || (WBTEST_JNL_FILE_LOST_DSKADDR == ydb_white_box_test_case_number))))
      
      From the core file, we do not have enough information to analyze the issue since the logic that
      led to this conclusion involves multiple regions and journal files. Therefore, the code is enhanced
      for debug builds to capture more trace information as we iterate through each region in the
      read_regions() function trying to locate a given seqno. This will help us better analyze such
      assert failures in the future.
      cb70f10b
    • K.S. Bhaskar's avatar
      Fix Issue #421 as well as undocumented but innocuous bug that added : to... · 10ab0f51
      K.S. Bhaskar authored
      Fix Issue #421 as well as undocumented but innocuous bug that added : to LD_LIBRARY_PATH. Use PATH instead of alias for executables (except GDE), and add $ydb_dist/plugin/bin to PATH if it exists
      10ab0f51
  2. 11 Mar, 2019 1 commit
    • Narayanan Iyer's avatar
      [#425] Various fixes to ydb_message() · 2bfc5311
      Narayanan Iyer authored
      1) Issue PARAMINVALID error if msg_buff input parameter is NULL.
      2) Fix incorrect fao count in rts_error_csa(INVSTRLEN) which caused additional SYSTEM-E-UNKNOWN to be printed.
      3) Reword UNKNOWNSYSERR error to indicate the integer error code which corresponds to an unknown error.
              Previously this printed a string (!AD) when there was no string available to be displayed.
      2bfc5311
  3. 07 Mar, 2019 2 commits
    • Narayanan Iyer's avatar
    • Stephen Johnson's avatar
      [#423] Eliminate a slight chance for a stack corruption when an interrupt occurs during execution. · 3540477a
      Stephen Johnson authored
      Because ARM64 has 8 registers for passing arguments to a function, if more
      than 8 arguments need to be passed, the stack is used. But in all these
      functions, that stack is modified before calling another function. A new
      value must be passed and room must be made in the parameter registers so one
      parameter register must be moved to the stack. This entails copying the
      existing stack, adding a new entry, and shifting the parameter registers.
      ARM64 requires that pushes and pops to the stack go in 16 byte chunks (2 64
      bit values at a time). The original attempt used a general purpose register
      and wrote on 8 byte value at a time, starting at the current stack pointer.
      When all the 8 byte values were on the stack, the stack pointer was set to
      point at the proper location. Unfortunately, during the stack copy time, the
      stack pointer wasn't always pointing to the real top-of-stack, only being
      updated when the copy was complete. If an interrupt occurred during this
      interval when the stack pointer wasn't correct, the interrupt handling
      routines could using the stack at the stack pointer corrupting values.
      
      To fix the problem, the stack pointer is used to push the values - making sure
      that everything is done 16 bytes at a time.
      
      Other cosmetic changes included in this commit are
      
      * Remove the line of code that includes the file "stack.si". There are no usages of
        stack.si macros in the current version of these files. The macros existed in prior
        versions, but were eliminated as the code evolved. Removing the references to stack.si
        should have been done earlier, but wasn't.
      
      * Remove these files as they are no longer used.
      
        ci_restart.s - should not have been included in the first place. It was
        mistakenly created because an i386 assembly file of the same
        existed.
      
        stack.si - due to code revisions and simplifications, the macros in this
        file were no longer used anywhere in the assembly source code.
        There is no reason to keep it.
      
      * Change the first argument passed to gvcmz_neterr() to be a 64 bit zero.
        Currently, the argument, w0, is only 32 bits as is the zero register, wzr.
      
      * Change the branch instruction to b.ne for consistency with the other conditional
        branch instructions in the code.
      
      * Change the reference to the 64 bit register X0 to x0 (upper case X to lower
        case x) for consistency with the other 64 bit registers in the assembly code.
      3540477a
  4. 12 Feb, 2019 1 commit
    • Narayanan Iyer's avatar
      [#419] Avoid hangs in source server due to trying to flush the journal buffers... · 021f6346
      Narayanan Iyer authored
      [#419] Avoid hangs in source server due to trying to flush the journal buffers while instance freeze is ON
      
      DO_JNL_FLUSH_IF_POSSIBLE macro is invoked as a desire to flush if possible.  If the act of flushing
      is going to hang due to a frozen instance, it is better to skip the jnl flush and avoid the hang.
      That is what is done as the fix in this commit.
      
      This addresses a hang seen in the v62000/gtm8086 subtest where the source server was
      stuck waiting for the instance to be unfrozen (while trying to flush the journal file using the
      DO_JNL_FLUSH_IF_POSSIBLE macro) while the test script (which does the unfreeze) was waiting for the
      source server to clear some backlog. Below is the C-stack of the stuck source server for the record.
      
      (gdb) where
       #0  clock_nanosleep () from /usr/lib64/libc.so.6
       #1  m_usleep () at sr_unix/sleep.c:25
       #2  wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:490
       #3  wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:513
       #4  jnl_write_attempt () at sr_port/jnl_write_attempt.c:335
       #5  jnl_flush () at sr_port/jnl_flush.c:57
       #6  update_max_seqno_info () at sr_unix/gtmsource_readfiles.c:741
       #7  first_read () at sr_unix/gtmsource_readfiles.c:881
       #8  read_regions () at sr_unix/gtmsource_readfiles.c:1711
       #9  read_and_merge () at sr_unix/gtmsource_readfiles.c:1544
       #10 gtmsource_readfiles () at sr_unix/gtmsource_readfiles.c:1974
       #11 gtmsource_get_jnlrecs () at sr_unix/gtmsource_process_ops.c:980
       #12 gtmsource_process () at sr_unix/gtmsource_process.c:1544
       #13 gtmsource () at sr_unix/gtmsource.c:528
       #14 mupip_main () at sr_unix/mupip_main.c:124
       #15 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:148
       #16 main () at sr_unix/mupip.c:19
      021f6346
  5. 08 Feb, 2019 1 commit
    • Narayanan Iyer's avatar
      [#418] ydb_file_id_free()/ydb_file_is_identical()/ydb_file_name_to_id() and... · ae786148
      Narayanan Iyer authored
      [#418] ydb_file_id_free()/ydb_file_is_identical()/ydb_file_name_to_id() and ydb_file_id_free_t()/ydb_file_is_identical_t()/ydb_file_name_to_id_t() issue PARAMINVALID error if input filename/fileid pointer is NULL
      
      The code previously used to assume that the fileid input parameter was non-NULL so if a NULL pointer was
      passed, one would get a SIG-11.
      
      In case of ydb_file_name_to_id(), a NULL filename parameter caused a return immediately with no
      fileid determination.
      
      In both cases, the code now issues a PARAMINVALID error. This is safer/better.
      ae786148
  6. 07 Feb, 2019 2 commits
    • Narayanan Iyer's avatar
      Reduce gtmsecshr executable size (real exe, not wrapper) (#387 changes were... · f0ff0912
      Narayanan Iyer authored
      Reduce gtmsecshr executable size (real exe, not wrapper) (#387 changes were missed out on this executable)
      
      While at this, remove commented out executables libgtmcrypt and libmumps.  Also remove libgtmtls
      and maskpass since they are part of the encryption plugin which is not distributed as part of the
      YottaDB release binary tarball. They need to be built from source anyways at the customer environment.
      
      The size of $ydb_dist/gtmsecshrdir/gtmsecshr on x86_64 reduced from 1Mb to 0.5Mb due to this change.
      f0ff0912
    • Christopher Edwards's avatar
      Clarify FAQ sections about libtinfo5 and binutils · 9f7ef3b9
      Christopher Edwards authored
      The libtinfo5 and binutils FAQ entries can apply to more than just
      Ubutntu 18.10. Clarify the sections to indicate that this could apply
      to more Linux distributions.
      9f7ef3b9
  7. 06 Feb, 2019 1 commit
  8. 04 Feb, 2019 2 commits
  9. 01 Feb, 2019 1 commit
  10. 31 Jan, 2019 3 commits
  11. 30 Jan, 2019 1 commit
  12. 28 Jan, 2019 2 commits
    • Narayanan Iyer's avatar
      Prepare for YottaDB r1.24 release · e1e394d8
      Narayanan Iyer authored
      e1e394d8
    • Narayanan Iyer's avatar
      [#362] Avoid data differences between primary and secondary in case of kill -9 on primary side · 7935825c
      Narayanan Iyer authored
      The first commit of #362 was done in 2018 (SHA 438aabe7) where
      the manually_start/dual_fail2_no_ipcrm2 subtest failed with a data difference. In that commit,
      mur_back_process.c was fixed so it treats an automatically generated NULL record as a broken
      transaction only in case of a fetchresync rollback.
      
      But we had another failure of the same manually_start/dual_fail2_no_ipcrm2 subtest yesterday
      where the automatically generated NULL record did not get treated as a broken transaction
      because this NULL record was generated on side B where B was the secondary when the kill -9
      happened (with no ipcrm) and the NULL record was generated by a non-fetchresync rollback
      because B was coming up as a primary (as part of a failover). Side A which was a primary
      before B was crashed, had a valid non-NULL record for the same seqno and since the rollback
      on Side B did not treat the NULL record as a broken transaction, it played that forward on B.
      This meant side B had a NULL record whereas side A had a non-NULL record corresponding to the
      same seqno resulting in data discrepancy between the two sides once B started as the primary
      replicating to A.
      
      In thinking about the original commit of #362, it is not clear why the change was restricted
      to only a fetchresync rollback. It seems more correct (and safer) to treat it as a broken
      transaction in all cases. That will address the most recent test failure too.
      7935825c
  13. 26 Jan, 2019 1 commit
    • Narayanan Iyer's avatar
      [#205] Restore non-YottaDB signal handlers only after YottaDB exit handlers have been invoked · 6706c483
      Narayanan Iyer authored
      The v62002/gtm6638 subtest failed once in a while on some systems with the following diff.
      
      39c39,40
      < Pass
      ---
      > Alarm clock
      > Fail: expected=559 actual=
      
      This test runs simplethreadapi 3n+1 for a while. As part of recent changes for YottaDB signal
      handlers to co-exist with non-YottaDB signal handlers, there is now code in ydb_exit() that
      resets the signal handlers to their non-YottaDB versions. But this is done BEFORE invoking
      gtm_exit_handler() in the MAIN worker thread in case ydb_exit() is invoked in some other thread in
      a SimplethreadAPI process. This means that it is possible that a SIGALRM timer is still active
      at the time we reset the SIGALRM signal handler in ydb_exit() but before gtm_exit_handler()
      (which does a cancel_timer()) has been invoked. If due to timing scenarios, this timer actually
      pops before the cancel_timer() is done, the non-YottaDB SIGALRM handler will kick in. And I think
      the system default handler for SIGALRM prints the "Alarm clock" message and just exits the process
      (without also invoking YottaDB exit handler). This is most likely what caused the test failure.
      
      The flow in ydb_exit() has been reworked so the signal handler reset happens AFTER the MAIN
      worker thread has exited (i.e. after it has invoked gtm_exit_handler()).
      
      Also noticed there was a pre-existing race condition in ydb_exit() (with multiple concurrent
      invocations from different threads) which is now addressed by ensuring we hold the ydb engine
      thread lock for the entire duration of the ydb_exit() even in the SimpleThreadAPI case
      (when the now-nixed "wait_for_main_worker_thread_to_die" variable was TRUE).
      
      Additionally, noticed the "struct sigaction" structure is not memset() to 0 before setting
      the SIGALRM handler in init_timers() in gt_timers.c. Fixed that too just in case it can cause
      other issues.
      6706c483
  14. 25 Jan, 2019 2 commits
  15. 24 Jan, 2019 1 commit
    • Narayanan Iyer's avatar
      [#205] Fix multi-thread safety issues with CALLINAFTERXIT and INVAPIMODE... · 94a9cb2b
      Narayanan Iyer authored
      [#205] Fix multi-thread safety issues with CALLINAFTERXIT and INVAPIMODE errors; Ensure errstr is filled in case of ydb_*_st() or ydb_*_t() calls which return these two errors; Nix INVAPIMODE error and instead add SIMPLEAPINOTALLOWED and THREADEDAPINOTALLOWED errors
      
      Note that whenever a SimpleThreadAPI function is mentioned below, it is meant functions of the form
      ydb_*_st() or ydb_*_t().
      
      * The primary issue is that SimpleThreadAPI functions could return with a YDB_ERR_CALLINAFTERXIT
        error in case the YottaDB engine has been shutdown (e.g. ydb_exit()).  But in that case, the "errstr"
        parameter was not filled in. It would contain garbage strings which is not user-friendly. It
        would be desirable to also fill "errstr" with the actual error string corresponding to
        CALLINAFTERXIT error. Towards this, a new macro SET_STAPI_ERRSTR_MULTI_THREAD_SAFE has been
        introduced which sets the errstr to the $zstatus corresponding to any passed in valid error code
        (e.g. YDB_ERR_CALLINAFTERXIT). This is a multi-thread safe macro (i.e. does not use the YottaDB
        engine other than to read the error message string table which is a read-only structure anyways)
        and hence can be safely invoked from SimplethreadAPI function calls.
      
      * While analyzing the above primary issue, I realized that YDB_ERR_CALLINAFTERXIT can also be issued
        from ydb_init() which could be implicitly called by any of the SimpleThreadAPI functions through
        the LIBYOTTADB_RUNTIME_CHECK* macros. Therefore those macros were redesigned to pass an "errstr"
        if the caller has access to it (i.e. if the caller is a SimpleThreadAPI function). Callers which
        do not have access to an "errstr" (e.g. ydb_*_s() functions) will pass a NULL parameter instead.
        The LIBYOTTADB_RUNTIME_CHECK* macros now invoke SET_STAPI_ERRSTR_MULTI_THREAD_SAFE to set "errstr"
        in case ydb_init() returns a non-zero status.
      
      * By a similar logic, the VERIFY_THREADED_API* macros are also enhanced to pass an additional
        "errstr" parameter. This is because this macro is called by all SimpleThreadAPI functions
        (in addition to the LIBYOTTADB_RUNTIME_CHECK* macro) before invoking ydb_stm_args*().
      
      * Since the VERIFY_THREADED_API* macros can issue an INVAPIMODE error, this error also needs to fill
        in errstr (like the CALLINAFTERXIT error is fixed above). But this message has parameters
        that need to be substituted. Since this error is issued in SimpleThreadAPI functions when
        they do not yet run in the MAIN worker thread, the INVAPIMODE error issuing logic (which used
        SETUP_GENERIC_ERROR_4PARMS, a routine that is not multi-thread safe) had to be reworked to
        instead use SET_STAPI_ERRSTR_MULTI_THREAD_SAFE. But in order to use that, we needed an error
        message with no parameters.  Since INVAPIMODE message has only two possibilities, we instead
        create two new messages with the appropriate text that way the two new messages do not need any
        parameters and hence can be used with SET_STAPI_ERRSTR_MULTI_THREAD_SAFE. SIMPLEAPINOTALLOWED
        and THREADEDAPINOTALLOWED are the new messages and INVAPIMODE is nixed.
      94a9cb2b
  16. 22 Jan, 2019 3 commits
  17. 18 Jan, 2019 3 commits
    • Narayanan Iyer's avatar
      [#205] Generate usable core files for raised fatal signals (e.g.... · 417b88f4
      Narayanan Iyer authored
      [#205] Generate usable core files for raised fatal signals (e.g. SIGSEGV/SIG-11) in multi-thread processes
      
      Problem statement
      -----------------
      In a multi-thread process, if a SIG-11 happens in say the TP worker thread, the signal handler
      generic_signal_handler() is invoked. That in turn notices the current thread is not the MAIN
      worker thread and so forwards the SIG-11 from the TP worker thread to the MAIN worker thread.
      And generic_signal_handler() is invoked again in the MAIN worker thread. It is this invocation
      that will call gtm_fork_n_core() for the fatal SIGSEGV signal. And since gtm_fork_n_core() does
      a fork and then dumps the core file, it is the MAIN worker thread's C-stack that will be captured
      in the core file (a fork only inherits the current thread's C-stack) making the core file unusable
      since we are interested in the SIG-11 that happened in the TP worker thread.
      
      Fix
      ---
      As part of the FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED macro invoked by the TP worker thread (in the
      above example on a SIG-11), after it forwards the signal to the MAIN worker thread, it does not
      immediately return but waits for a signal (through a new global variable "safe_to_fork_n_core")
      from the MAIN worker thread to indicate it is safe to do a "gtm_fork_n_core" call from the TP
      worker thread even though it is not the MAIN worker thread. The MAIN worker thread pauses execution
      while this core dump happens in the TP worker thread and then continues with its cleanup.
      
      A new macro MULTI_THREAD_AWARE_FORK_N_CORE is invoked by the MAIN worker thread in generic_signal_handler()
      wherever it needs to do a gtm_fork_n_core(). This macro checks if this is a RAISED signal (e.g. SIGSEGV,
      SIGILL etc.) and if so sets safe_to_fork_n_core to TRUE and waits for this to be reset to FALSE (will be
      reset by the TP worker thread or whichever thread got the SIGSEGV and is in the
      FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED macro). If it is not a RAISED signal, this macro does a gtm_fork_n_core()
      in the MAIN worker thread itself.
      417b88f4
    • Narayanan Iyer's avatar
      [#205] Handle case where MAIN worker thread is asked to exit right away... · c2486ced
      Narayanan Iyer authored
      [#205] Handle case where MAIN worker thread is asked to exit right away without waiting for a logical point to be reached (i.e. forced_simplethreadapi_true == FALSE)
      
      A new function ydb_stm_thread_exit() is added to sr_unix/ydb_stm_thread.c.
      This takes care of exit handling related activities for the MAIN worker thread.
      
      It is invoked from ydb_stm_thread() when a logical point has been reached where it is safe to exit.
      This is the case where "forced_simplethreadapi_true" is TRUE.
      
      ydb_stm_thread_exit() is also invoked (from gtm_exit_handler()) in case "forced_simplethreadapi_exit"
      is FALSE. In this case, it is possible we will never reach a logical point for safe exit (i.e. TP
      worker thread can never terminate as it is waiting for MAIN worker thread to service request which
      has been interrupted to handle exit handler request). To avoid deadlock in these cases, we do not
      indefinitely wait, for TP worker thread to terminate, in the MAIN worker thread.  Instead we wait
      for 1000 iterations of 1 microsecond each for a total of around 1 milli-second per TP worker thread
      before moving on with YottaDB exit handling.
      
      Additionally, changes to sr_port/deferred_signal_handler.c and sr_unix/generic_signal_handler.c
      that deferred exit processing in case we are the MAIN worker thread (done in a prior commit
      305fe697) are now reverted. This is because it is possible the MAIN
      worker thread is servicing a long running command (e.g. ydb_ci_t of a call-in M program that runs
      for ever until stopped using a kill -15 as is done in the dual_fail_extend/dual_fail2_mustop_sigquit
      subtest). In that case if we defer the signal, the MAIN worker thread that is in ydb_ci_t() will
      never come back to ydb_stm_threadq_dispatch() which means the process will never terminate if we
      defer exit handling.  With the introduction of ydb_stm_thread_exit() in the current commit, the
      MAIN worker thread will wait for TP worker threads to terminate and time out (instead of waiting
      indefinitely) and continue with exit processing. This wait should address the original issue raised
      by the prior commit and so it is okay to revert these two module changes from that commit.
      c2486ced
    • Narayanan Iyer's avatar
      [#205] If MAIN worker thread gets SIG-15, wait for MAIN/TP worker threads to... · 305fe697
      Narayanan Iyer authored
      [#205] If MAIN worker thread gets SIG-15, wait for MAIN/TP worker threads to reach logical point before starting exit handler processing
      
      We had a test failure (in the dual_fail_extend/dual_fail2_mustop_sigquit subtest) where a SimpleThreadAPI
      process was sent a SIG-15 by the test and the signal got delivered to the MAIN worker thread but it
      went ahead with exit handler processing (including rolling back an active TP transaction) while a
      TP worker thread was concurrently running the TP callback function without realizing all of this going on.
      The TP worker thread effectively got an INVTPTRANS error since it was using a non-zero tptoken in a
      ydb_set_st() call when there was no active TP transaction (due to the exit handler doing an op_trollback()).
      
      The fix is to defer exit processing in generic_signal_handler.c if we find out that we are the
      MAIN worker thread. This way the MAIN worker thread will invoke the exit handler gtm_exit_handler()
      inside ydb_stm_thread() when it knows it is a logical/safe point to do so.
      
      In addition, deferred_signal_handler() is now fixed to skip invoking the exit handler in case we
      are the MAIN worker thread. This is because ydb_stm_thread() has an already established scheme
      (using "forced_simplethreadapi_exit" global variable) to determine the logical point and then invoke
      gtm_exit_handler().
      
      Below is the C-stack of all threads at the time of the core for the record.
      
      (gdb) thread apply all bt
      
      Thread 3 (Thread 0x7fde4cb67700 (LWP 14698)):
       #0  fsync () from /usr/lib64/libc.so.6
       #1  jnl_fsync (reg=0x55af6c90e7b8, fsync_addr=38517184) at sr_unix/jnl_fsync.c:134
       #2  wcs_flu (options=519) at sr_unix/wcs_flu.c:413
       #3  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608
       #4  gv_rundown () at sr_port/gv_rundown.c:123
       #5  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:216
       #6  __run_exit_handlers () from /usr/lib64/libc.so.6
       #7  exit () from /usr/lib64/libc.so.6
       #8  gtm_image_exit (status=-15) at sr_unix/gtm_image_exit.c:27
       #9  generic_signal_handler (sig=15, info=0x7fde4cb66830, context=0x7fde4cb66700) at sr_unix/generic_signal_handler.c:380
       #10 <signal handler called>
       #11 pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
       #12 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:123
       #13 start_thread () from /usr/lib64/libpthread.so.0
       #14 clone () from /usr/lib64/libc.so.6
      
      Thread 2 (Thread 0x7fde510c6dc0 (LWP 14695)):
       #0  do_futex_wait.constprop () from /usr/lib64/libpthread.so.0
       #1  __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0
       #2  ydb_stm_args (callblk=0x55af6c96b550) at sr_unix/ydb_stm_args.c:183
       #3  ydb_stm_args5 (tptoken=0, errstr=0x0, calltyp=16, p1=94211928230125, p2=140733677288928, p3=94211928265280, p4=1, p5=140733677288912) at sr_unix/ydb_stm_args.c:320
       #4  ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x55af6c8408ed <tpfn_stage1>, tpfnparm=0x7fff1cd7bde0, transid=0x55af6c849240 <tptypebuff> "BATCH", namecount=1, varnames=0x7fff1cd7bdd0) at sr_unix/ydb_tp_st.c:33
       #5  impjob (childnum=2) at simplethreadapi_imptp.c:1148
       #6  main (argc=1, argv=0x7fff1cd7c198) at simplethreadapi_imptp.c:602
      
      Thread 1 (Thread 0x7fde47fff700 (LWP 14705)):
       #0  pthread_kill () from /usr/lib64/libpthread.so.0
       #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
       #2  ch_cond_core () at sr_unix/ch_cond_core.c:76
       #3  rts_error_va (csa=0x0, argcnt=7, var=0x7fde47ffeaa0) at sr_unix/rts_error.c:194
       #4  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
       #5  ydb_stm_args (callblk=0x7fde40000b20) at sr_unix/ydb_stm_args.c:126
       #6  ydb_stm_args4 (tptoken=7085, errstr=0x0, calltyp=12, p1=94211928265184, p2=2, p3=94211928261632, p4=94211928263568) at sr_unix/ydb_stm_args.c:298
       #7  ydb_set_st (tptoken=7085, errstr=0x0, varname=0x55af6c8491e0 <ygbl_arandom>, subs_used=2, subsarray=0x55af6c848400 <subscr>, value=0x55af6c848b90 <ybuff_val>) at sr_unix/ydb_set_st.c:33
       #8  tpfn_stage1 (tptoken=7085, errstr=0x0, parm_array=0x7fff1cd7bde0) at simplethreadapi_imptp.c:1384
       #9  ydb_stm_tpthreadq_process (curTPWorkQHead=0x7fde48024c40, forced_simplethreadapi_exit_seen=0x7fde47ffeea8) at sr_unix/ydb_stm_tpthread.c:225
       #10 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:84
       #11 start_thread () from /usr/lib64/libpthread.so.0
       #12 clone () from /usr/lib64/libc.so.6
      305fe697
  18. 17 Jan, 2019 7 commits
    • Narayanan Iyer's avatar
      [#205] If exit handling is deferred in SimpleThreadAPI process (e.g. TP... · 8c2e7f3a
      Narayanan Iyer authored
      [#205] If exit handling is deferred in SimpleThreadAPI process (e.g. TP transaction in final retry when SIGTERM/SIG-15 is received), MAIN/TP worker threads should service requests without YDB_ERR_CALLINAFTERXIT errors until the TP transaction commits
      
      If SIGTERM is sent and generic_signal_handler() gets invoked, if we find that DEFER_EXIT_PROCESSING
      is TRUE, we do not invoke the exit handler right away but instead defer it until it is safe to
      start exit handler processing. But we do invoke SET_FORCED_EXIT_STATE which would set the global
      variable "forced_thread_exit" to TRUE. And since this is the variable currently relied upon by the
      SimpleThreadAPI worker threads (ydb_stm_thread.c and ydb_stm_tpthread.c), they would return
      YDB_ERR_CALLINAFTERXIT on all pending requests in their work queues. But this means that if a TP
      transaction is active and in the final retry (which also means we are holding crit on the database)
      at the time the SIGTERM was received and exit handling was deferred, this TP transaction will never
      commit fine because any ydb_*_s() requests done inside this callback function after the SIGTERM signal
      got sent would return with a YDB_ERR_CALLINAFTERXIT error. This is not desirable as we want the
      crit-holding transaction to be done as soon as possible and in a clean fashion.
      
      Therefore, the worker threads design is reworked a bit to now rely on "forced_simplethreadapi_exit",
      a new global variable. On seeing this, they will exit right away (what they previously used to do
      when they saw "forced_thread_exit" to be TRUE).
      
      ydb_exit() and generic_signal_handler() will invoke SET_FORCED_EXIT_STATE to set "forced_thread_exit"
      to TRUE whenever they want the SimpleThreadAPI process to terminate at the next logical point.
      The MAIN worker thread, before it attempts to service any request, will check "forced_thread_exit"
      and if it finds this to be TRUE, but finds "forced_simplethreadapi_exit" to be FALSE and OK_TO_INTERRUPT
      is TRUE, it will set "forced_simplethreadapi_exit" to be TRUE to indicate the logical point has been
      reached and that the worker thread should exit right away.
      8c2e7f3a
    • K.S. Bhaskar's avatar
      [#406] Move mupip journal recovery and deletion of old prior generation... · a0472d5e
      K.S. Bhaskar authored
      [#406] Move mupip journal recovery and deletion of old prior generation journal files and mupip journal recovery logs from ydb to ydb_env_set; Remove mupip rundown - it's not needed since database is recovered on startup
      a0472d5e
    • Narayanan Iyer's avatar
      [#205] Handle potential YDB_ERR_CALLINAFTERXIT return from TP callback function in TP worker thread · 1312538f
      Narayanan Iyer authored
      The v60000/gtm4525b subtest failed (1 out of 100 runs or so) with the following assert failure.
      The TP worker thread asked for a op_trollback to be done by the MAIN worker thread (using the
      LYDB_RTN_TP_ROLLBACK_TLVL0 opcode) but instead got a YDB_ERR_CALLINAFTERXIT status returned.
      This is possible because the test does a MUPIP STOP (i.e. kill -15) of processes which would
      cause the MAIN/TP worker threads to be signaled to terminate at which point, any more requests
      from them will return with YDB_ERR_CALLINAFTERXIT. The assert is now modified to take this into
      account. Although this did not show up in the test failure, a similar issue exists with the opcode
      LYDB_RTN_TP_ROLLBACK_TLVL0 and so even in that case we now handle the possibility of
      YDB_ERR_CALLINAFTERXIT.
      
      Below is a gdb session of the core failure for the record.
      
       #0 pthread_kill () from /usr/lib64/libpthread.so.0
       #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72
       #2 ch_cond_core () at sr_unix/ch_cond_core.c:76
       #3 rts_error_va () at sr_unix/rts_error.c:194
       #4 rts_error_csa () at sr_unix/rts_error.c:101
       #5 ydb_stm_tpthreadq_process () at sr_unix/ydb_stm_tpthread.c:259
       #6 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:83
       #7 start_thread () from /usr/lib64/libpthread.so.0
       #8 clone () from /usr/lib64/libc.so.6
      
      (gdb) f 5
       #5  ydb_stm_tpthreadq_process () at sr_unix/ydb_stm_tpthread.c:259
      259                                                     assert(YDB_TP_ROLLBACK == rlbk_retval);
      
      (gdb) p rlbk_retval
      $1 = -150381530
      
      (gdb) p int_retval
      $2 = -150381530
      
      libydberrors.h:#define YDB_ERR_CALLINAFTERXIT -150381530
      1312538f
    • Narayanan Iyer's avatar
      [#205] Make ydb_init() multi-thread safe in case of STAPIFORKEXEC and... · a3b3158d
      Narayanan Iyer authored
      [#205] Make ydb_init() multi-thread safe in case of STAPIFORKEXEC and CALLINAFTERXIT errors; Ensure ydb_zstatus() returns error string after a CALLINAFTERXIT return from ydb_init
      
      Ensure we get the YottaDB engine pthread mutex before checking for STAPIFORKEXEC or CALLINAFTERXIT
      errors. This is needed since SETUP_GENERIC_ERROR macro (which is invoked in both these error scenarios)
      operates on global variables (e.g. dollar_zstatus) and therefore has to be multi-thread safe since
      ydb_init() is supposed to be multi-thread safe.
      
      Invoke SETUP_GENERIC_ERROR macro in case of a CALLINAFTERXIT error in ydb_init(). This error was being
      issued in two places in the same function. The duplication is now removed too.
      a3b3158d
    • Narayanan Iyer's avatar
      [#205] Do not print CALLINAFTERXIT error message from a ydb_init() call onto... · 0638d296
      Narayanan Iyer authored
      [#205] Do not print CALLINAFTERXIT error message from a ydb_init() call onto stderr/syslog; Just return YDB_ERR_CALLINAFTERXIT
      
      Writing to stderr or syslog (through gtm_putmsg_csa or send_msg_csa) is user-unfriendly for what is
      a programming error. Best to return the error through the invoked YottaDB function so caller can
      then handle that as appropriate.
      0638d296
    • Narayanan Iyer's avatar
      [#205] Fix potential lost wake-up (and accompanying deadlock in pthread_join... · 22e1fcce
      Narayanan Iyer authored
      [#205] Fix potential lost wake-up (and accompanying deadlock in pthread_join during exit handling) by instead doing non-blocking join of MAIN/TP worker threads in a sleep-loop and sending multiple wake-ups
      
      * The main changes are in sr_unix/gtmci.c and sr_unix/ydb_stm_thread.c.
        These are necessary to fix a deadlock that happens when the thread invoking ydb_exit() does
        a "pthread_cond_signal" to wake up a MAIN/TP worker thread but the receiving thread is not yet
        in a "pthread_cond_wait". The wake up signal sent is therefore lost. And this implies that the
        "pthread_join" that the ydb_exit() thread runs will hang forever in case the receiving worker
        thread soon afterwards goes to do a "pthread_cond_wait". This is now fixed to do a non-blocking
        join (using the Linux-specific pthread_tryjoin_np() function) in a sleep-loop and do a
        "pthread_cond_signal" in each iteration of the loop. Additionally, the cond/mutex variables
        across the various structures (stmWorkQueue, stmTPWorkQueue) are now destroyed only after they
        have been used for waiting/signaling a wake-up and once they are definitely no longer needed.
        This meant moving the destroy logic for those cond/mutex variables used by the MAIN worker thread
        to the ydb_exit()-invoking thread and moving the destroy logic for those cond/mutex variables
        used by the TP worker thread to the MAIN worker thread.
      
      * Also fixed cosmetic tab vs space issues in sr_unix/libyottadb.h
      22e1fcce
    • Narayanan Iyer's avatar
      [#205] In ydb_fork_n_core/gtm_fork_n_core, preserve C-stack of all threads in... · 8bed423e
      Narayanan Iyer authored
      [#205] In ydb_fork_n_core/gtm_fork_n_core, preserve C-stack of all threads in core for better debugging (only for DEBUG builds)
      
      A prior commit enabled a similar change in sr_unix/ch_cond_core.c but that was done only in case of
      a fatal error in the YottaDB engine. It is possible for ydb_fork_n_core/gtm_fork_n_core to be called
      without the engine encountering a fatal error (e.g. a test C program that uses the SimpleThreadAPI
      could encounter a YDB_ASSERT macro failure which will invoke ydb_fork_n_core) and we want the
      C-stack of all threads even in that case for better debugging.
      
      Note that because of this change, the code flow for PRO vs DBG builds is different. In PRO, one
      would invoke a ydb_fork_n_core/gtm_fork_n_core, generate a core (to create a snapshot of the process
      state for later debugging) but the process would continue. Whereas in DBG, the process would create
      the core and terminate right then. Given this is done only in DBG builds, it is considered okay.
      8bed423e
  19. 16 Jan, 2019 1 commit
    • Narayanan Iyer's avatar
      Fix few more build warnings from gcc LTO; Speed up LTO time by parallelizing link · fec905f3
      Narayanan Iyer authored
      The LTO (Link Time Optimization) build of gcc showed a few warnings only on an Ubuntu 16.04 system.
      The warnings were that a global variable was declared with inconsistent types across multiple C files.
      In all these cases, the GBLDEF type was taken as the correct one and all the GBLREF usages were
      fixed to match the GBLDEF. This also meant in some cases adding a type cast before calling system
      functions (e.g. rename(), unlink() etc.).
      
      Since the global variable "source_name_len" was fixed to be an "unsigned short" everywhere, for
      consistency the global variable "object_name_len" was also fixed to be an "unsigned short" everywhere.
      
      In addition, the cmake script now determines the # of CPUs available on the system and passes it
      to the -flto option to let the link also be parallelized. This greatly cut the link time for
      libyottadb.so (in one system, total build time reduced from 3.25 minutes to 1.75 minutes).
      fec905f3
  20. 15 Jan, 2019 2 commits
    • Narayanan Iyer's avatar
      [#205] Request TROLLBACK in TP worker thread only if still in TP; Avoids... · 038c6f37
      Narayanan Iyer authored
      [#205] Request TROLLBACK in TP worker thread only if still in TP; Avoids YDB_ERR_INVTPTRANS errors while queueing the LYDB_RTN_TP_ROLLBACK_TLVL0 request (issue exposed by r124/ydb383 subtest in SimpleThreadAPI mode)
      038c6f37
    • Narayanan Iyer's avatar
      [#205] Ensure exit handler code always runs in MAIN worker thread in SimpleThreadAPI mode · aca5db63
      Narayanan Iyer authored
      * In gtm_exit_handler(), which is the function guaranteed to be invoked when a YottaDB process needs
        to exit, if SimpleThreadAPI is in effect and we are not the MAIN worker thread, call ydb_exit() so
        worker/tp threads are signaled to exit, exit handler is driven from the MAIN worker thread, and we
        wait for all those threads to terminate before ydb_exit() returns.
      
      * In ydb_exit(), at function entry check if simpleThreadAPI_active is TRUE and we are not the MAIN
        worker thread, set a global variable "forced_thread_exit" to TRUE and send a signal through
        pthread_cond_signal() to indicate to the MAIN and TP worker threads that they need to exit at
        a logical point. And then wait for those threads to die.  And then return to caller. Also fix
        various edge cases in ydb_exit() so it is multi-thread safe always.  A lot of the SimpleThreadAPI
        related cleanup code has now been moved to ydb_stm_thread.c where the MAIN worker thread runs. It
        does all this cleanup when it exits. Some "SEE TODO" items are also taken care of now and so removed.
      
      * ydb_stm_args*() functions now do not wait for the call block to be serviced by the MAIN worker thread
        in case "forced_thread_exit" is TRUE. They return with CALLINAFTEREXIT error in this case. This ensures
        that new calls to SimpleThreadAPI functions after a ydb_exit() is done no longer queue a request to
        the non-existent MAIN worker thread (which is non-existent or in the process of concurrently exiting).
      
      * If MAIN worker thread (ydb_stm_thread.c), check if forced_thread_exit is TRUE. If so, go through the
        queue and service each waiting request with a CALLINAFTEREXIT error as the return value. And then
        invoke the exit handler gtm_exit_handler() and then wait for TP worker threads to terminate and do
        various SimpleThreadAPI data structure cleanup before exiting from the worker thread.
      
      * If TP worker thread (ydb_stm_tpthread.c), check if forced_thread_exit is TRUE. If so, go through the
        queue and service each waiting request with a CALLINAFTEREXIT error as the return value. And then exit.
      aca5db63
  21. 11 Jan, 2019 1 commit