GitLab Commit is coming up on August 3-4. Learn how to innovate together using GitLab, the DevOps platform. Register for free: gitlabcommitvirtual2021.com

  1. 21 Jul, 2021 1 commit
    • Brad Westhafer's avatar
      [#754] Add new ydbinstall script options to build and install YDB from a git repository · f1a8a0b0
      Brad Westhafer authored
      This commit adds the ``--from-source` and `--branch` options to the ydbinstall script to build and install YottaDB from a git repository.
      
      The `--from-source` option does a git clone on the specified git repo, builds YottaDB from this repo and passes all relevant arguments on to the build's version of ydbinstall to install the build.
      
      The `--branch` option specifies a git branch to build from when the `--from-source` option is also specified. This does a git checkout command to select the branch before building YottaDB. If the specified branch exists, it will build the branch. Otherwise, the build fails with an appropriate error message.
      f1a8a0b0
  2. 20 Jul, 2021 4 commits
    • Narayanan Iyer's avatar
      [DEBUG-ONLY] Fix rare v61000_intrpt_wcs_wtstart subtest failure by accounting... · c48395c4
      Narayanan Iyer authored
      [DEBUG-ONLY] Fix rare v61000_intrpt_wcs_wtstart subtest failure by accounting for WBTEST_SLEEP_IN_WCS_WTSTART
      
      * The `v61000_0/intrpt_wcs_wtstart` subtest in the `YDBTest` project occasionally failed in in-house testing
        with the following symptom.
      
        ```diff
        > %YDB-F-ASSERT, Assert failed in sr_port/t_end.c line 1325 for expression (ydb_white_box_test_case_enabled && (WB_PHASE2_COMMIT_ERR || (WBTEST_JNL_FILE_LOST_DSKADDR == ydb_white_box_test_case_number)))
        ```
      
      * This turns out to be a longstanding issue in code designed for the `WBTEST_SLEEP_IN_WCS_WTSTART` white-box
        test case. And since it is white-box code related, it is only an issue in Debug builds. Not Release builds.
      
      * In the failed test run, the value of the env var `gtm_white_box_test_case_count` was `706`. This meant
        it would exercise the `SLEEP_ON_WBOX_COUNT(6)` code in `sr_unix/wcs_wtstart.c` which would effectively
        cause the process to go into a sleep loop just before the `DECR_INTENT_WTSTART` call (i.e. `cnl->intent_wtstart`
        w...
      c48395c4
    • Jaahanavee Sikri's avatar
      [DOC] Adding release notes to YDB repo · aad46677
      Jaahanavee Sikri authored
       - Added r1.32 release notes
      aad46677
    • Narayanan Iyer's avatar
      Fix race conditions in file descriptor validity check code by reverting... · 8c554336
      Narayanan Iyer authored
      Fix race conditions in file descriptor validity check code by reverting D9I11-002714 changes (from GT.M V5.3-004)
      
      * The `stress_1/concurr` subtest failed once on an in-house `x86_64` system due to an assert failure in the
        update helper writer process (on the receiver side of the replicated test) with the following symptom.
      
        ```sh
        $ cat RCVR_22_02_14.log.uhw_49346
        Wed Jul  7 22:02:14 2021 : Helper writer started. PID 49346 [0xC0C2]
        %YDB-F-ASSERT, Assert failed in sr_unix/gtm_dbjnl_dupfd_check.c line 76 for expression (FALSE)
        Wed Jul  7 22:02:23 2021 : Helper exiting...
        ```
      
      * The corresponding C-stack in the generated core file is pasted below.
      
        ```c
        #0  pthread_kill () from /usr/lib64/libpthread.so.0
        #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
        #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
        #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
        #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffc11714d60) at sr_unix/rts_error.c:192
        #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
        #6  gtm_check_fd_is_valid (reg=0x110f730, is_db=0, fd=13) at sr_unix/gtm_dbjnl_dupfd_check.c:76
        #7  gtm_dupfd_check_specific (reg=0x110f730, open_fdarray=0x7ffc117150b0, fd=13, is_db=0) at sr_unix/gtm_dbjnl_dupfd_check.c:98
        #8  gtm_dbjnl_dupfd_check () at sr_unix/gtm_dbjnl_dupfd_check.c:164
        #9  jnl_file_open (reg=0x110f3e0, init=1) at sr_unix/jnl_file_open.c:213
        #10 jnl_ensure_open (reg=0x110f3e0, csa=0x1128840) at sr_port/jnl_ensure_open.c:71
        #11 updhelper_writer () at sr_port/updhelper_writer.c:148
        #12 mupip_main (argc=4, argv=0x7ffc11718db8, envp=0x7ffc11718de0) at sr_unix/mupip_main.c:122
        #13 dlopen_libyottadb (argc=4, argv=0x7ffc11718db8, envp=0x7ffc11718de0, main_func=0x401470 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
        #14 main (argc=4, argv=0x7ffc11718db8, envp=0x7ffc11718de0) at sr_unix/mupip.c:22
        ```
      
      * This turned out to be an incorrect assert failure due to a race condition.
      
      * In this case, what happened is that the update process (say `P1`) was in the process of switching the
        journal file of the `EFREG` region and had finished executing `Line 218` but had not started to execute
        `Line 219` below (all while holding crit).
      
        ```c
        jnl_file_close.c
        ----------------
            213   /* jnl_file_id should be nullified only after the jnl file header has been written to disk.
            214    * Nullifying the jnl_file_id signals that the jnl file has been switched. The replication source server
            215    * assumes that the jnl file has been completely written to disk (including the header) before the switch is
            216    * signalled.
            217    */
            218   NULLIFY_JNL_FILE_ID(csa);
            219   jb->cycle++;    /* increment shared cycle so all future callers of jnl_ensure_open recognize journal switch */
        ```
      
      * At the same time, an update helper writer process (say `P2`) was in the process of opening the journal file
        of the `DEFAULT` region. As part of that, it executes Debug-only logic which checks all open database and/or
        journal file descriptors in the process to make sure they are valid. The helper writer process had already
        opened a journal file for `EFREG` region and so that got checked below (note: we don't hold crit on `EFREG`
        region at this point).
      
      * In `P2`, the left hand side of the `&&` check at `Line 74` below evaluated to `TRUE` (because `Line 218` above
        has completed executing in `P1`). At the same time, the right hand side of the `&&` check in the same line
        evaluated to `TRUE` (because `Line 219` above has not yet executed in `P1`).
      
        ```c
        gtm_dbjnl_dupfd_check.c
        -----------------------
             71    /* If fd does not point back to journal file, it could be because of a concurrent journal switch.
             72     * Check that. If that fails as well, go ahead and fix the journal file descriptor.
             73     */
             74    if (!is_gdid_stat_identical(JNL_GDID_PTR(csa), &stat_buf) && !JNL_FILE_SWITCHED(csa->jnl))
             75    {
             76            FIX_CORRUPT_JNLFD(reg); /* Journal file fd is corrupt. Fix it. */
        ```
      
      * And therefore, we went into `Line 76` above thinking there is a corrupt fd. But actually if we had waited for
        `Line 219` in `P1` to finish, `P2` would not have come to this incorrect conclusion and would not have assert
        failed incorrectly.
      
      * A Debug/DBG build would assert fail like it happened in the test failure. And a Release/PRO build will generate
        a core file for later analysis, send a GVFAILCORE syslog message and continue processing as if this check had
        succeeded. So the user will not notice any disruptions other than a core file.
      
      * But the conditions required for this to be an issue in Release/PRO builds are that the process opens more
        than 256 file descriptors (which is in itself a rare thing) and then encounter the above small timing window
        (`P1` is in between `Line 218` and `Line 219` and `P2` is in `Line 74` above). This set of conditions is almost
        impossible to achieve in practice. But this needs to be fixed nevertheless.
      
      * In my understanding, the correct fix for this is for `P2` to grab the critical section lock in case `Line 74`
        evaluates to `TRUE` and then redo the check and if it is still `TRUE` then go ahead with the `FIX_CORRUPT_JNLFD`
        macro like it currently does. This would fix the issue because grabbing the critical section (aka `crit`) would
        have to wait for `P1` to release the critical section lock which in turn means it would wait for `P1` to finish
        executing `Line 219` thereby avoiding the race condition that `Line 74` sees when outside of `crit`.
      
      * But grabbing `crit` is an overhead for what seems like debug code.
      
      * I therefore looked some more at `why` the duplicate fd checking code was added in the first place. The
        comments suggest `D9I11-002714`.
      
      * The corresponding fixes happened in GT.M V5.3-004A (released in the year `2009`). The relevant release note
        is at http://tinco.pair.com/bhaskar/gtm/doc/articles/GTM_V5.3-004A_Release_Notes.html#S9I11_002714_Better_protection
        and is pasted below.
      
        ```
        S9I11-002714 Better protection against damaged journal file descriptors
        -----------------------------------------------------------------------
        GT.M now checks the validity of the journal file descriptor if it encounters an ENO9 (EBADF) or ENO29
        (ESPIPE) error while writing to the journal file. If GT.M finds the descriptor invalid, GT.M now creates a
        core file, closes that journal file and resumes operation. Furthermore, when a GT.M process opens a database
        file or journal file, it now checks whether the new file descriptor matches that of any open database
        or journal file. If so, GT.M creates a core file in the current working directory of the process. Then,
        if the invalid file descriptor corresponds to a journal file the process closes that journal file, and
        continues operation; if the invalid file descriptor corresponds to a database file, the process terminates
        abnormally in order to avoid potential database damage. This change was in response to a handful of reports
        of unexplained journal file closings and one report of writes to a journal file incorrectly redirected
        to a database file resulting in database and journal file damage all on AIX pSeries. We were not able
        to determine the root cause - whether in GT.M, in AIX, or in calls to application code written in C -
        but this change protects against that error whatever the cause. [UNIX] (S9I11-002714)
        ```
      
      * The issue seems to be an `AIX` only issue. A platform that YottaDB does not support. As far as I can remember,
        this file descriptor validity check never helped catch a real issue in the code since then.
      
        In my opinion, this validity check is not as useful for `YottaDB` as it might be for the upstream `GT.M`.
      
      * Given the validity check has race conditions that need to be fixed and that it is not as useful for `YottaDB`,
        I decided to remove this file descriptor validity check code altogether in `YottaDB`.
      
        This involved removing calls to `GTM_FD_TRACE_ONLY()`/`gtm_dbjnl_dupfd_check()`/`gtm_check_fd_is_valid()`
        in the following files.
        - sr_port/gvcst_init.c
        - sr_unix/jnl_file_open.c
        - sr_unix/jnl_output_sp.c
      
        This also meant the following files could now be removed so they are removed.
        - sr_unix/gtm_dbjnl_dupfd_check.c
        - sr_unix/gtm_dbjnl_dupfd_check.h
      
        Note that only the change to `sr_unix/jnl_file_open.c` was enough to address the current test failure.
      
      * Additionally, I noticed that the `GTM_FD_TRACE` macro, when defined, redirected the `open()`, `close()`, etc.
        system calls to interludes that recorded more information for debugging purposes. This stuff was no longer
        necessary in Release/PRO builds as the main purpose of this recording was to do the database/journal file
        descriptor validity checking which was removed as part of the previous bullet.
      
        Therefore, made a change to define the `GTM_FD_TRACE` macro only in Debug builds (i.e. if `DEBUG` macro is
        defined). This change was done in `sr_port/mdef.h`. This way we continue to have the fd trace debugging
        (which tells us where in the code an `fd` was opened, closed etc.) enabled for Debug/DBG builds where we might
        need it but do not carry that overhead in Release/PRO builds.
      8c554336
    • Narayanan Iyer's avatar
      [DEBUG-ONLY] Log WBTEST_HOLD_GTMSOURCE_SRV_LATCH related message in source... · d209df26
      Narayanan Iyer authored
      [DEBUG-ONLY] Log WBTEST_HOLD_GTMSOURCE_SRV_LATCH related message in source server log to fix v61000_1/gtm7858 subtest timing failure
      d209df26
  3. 16 Jul, 2021 2 commits
    • Brad Westhafer's avatar
      [#705] New --plugins-only ydbinstall option installs plugins for an existing YottaDB installation · 8fe25164
      Brad Westhafer authored
      This commit adds a new `--plugins-only` option to the ydbinstall script which allows for the YDBAIM, YDBCrypt, YDBOcto, YDBPosix and YDBZlib plugins to be installed for an existing YottaDB installation. Previously, the ydbinstall script could only install plugins for a new YottaDB installation.
      
      The plugins selected via the options `--aim`, `--encplugin`, `--posix`, `-octo` and `--zlib` are installed to the YottaDB version pointed to by `--installdir` (if it exists). Otherwise, they are installed to `$ydb_dist` (if it exists) or to the version pointed to by pkg-config if neither --installdir or `$ydb_dist` is set. If no plugins are selected, the `--plugins-only` option will do nothing.
      8fe25164
    • Ashok Bhaskar's avatar
      [YottaDB/DBMS/YDBOcto#716] Update `copyright.py` to refer to the current year dynamically · 5ea482c3
      Ashok Bhaskar authored
      * Updates tools/ci/copyright.py to get the current year based on local time programmatically instead of using
        a hard-coded value that needs to be manually updated each year
      
      * Tracked by YottaDB/DBMS/YDBOcto#716
      5ea482c3
  4. 14 Jul, 2021 1 commit
    • Narayanan Iyer's avatar
      [#757] SET x=$ZYHASH(x) no longer makes x undefined · 6041f67e
      Narayanan Iyer authored
      * The issue was in `op_fnzyhash.c` where it did not handle the case where the source and destination
        mvals (`string` and `ret` respectively) could be identical. In that case, the first step it did was
        to set `ret->mvtype` to `0`. This unintentionally also reset `string->mvtype` to `0`. And that meant
        later code that did an `MV_FORCE_STR(string)` got a `%YDB-E-LVUNDEF` error (due to invoking `underr()`
        inside the `MV_FORCE_DEFINED` macro.
      
      * This is now fixed by checking if they are identical and if so skipping the set of `ret->mvtype` as
        it does not hold an uninitialized value (which would be the case if `ret` and `string` were different).
      6041f67e
  5. 13 Jul, 2021 1 commit
    • Brad Westhafer's avatar
      [#756] Remove ydb_env_set from ydbinstall option --zlib and fix ydb_env_set error message · 01df2a19
      Brad Westhafer authored
      The ydbinstall option `--zlib` sometimes produced a `ydbinstall: 29: [: UTF-8: unexpected operator` error message when running a `source ydb_env_set` which actually meant line 29 of ydb_env_set not line 29 of ydbinstall. After removing the `source ydb_env_set` call from the `--zlib` option, the zlib install still works fine so this commit removes the call to ydb_env_set from the `--zlib` option.
      
      This commit also fixes the ydb_env_set error message which was due to the use of `==` instead of `=`. While `==` works fine in bash, the script uses sh which recognizes `=` inside of `[]`. This causes an error message because sh doesn't support `==`. The fix was to change the `==` on that line to `=`. The error message also showed up when running `source ydb_env_set` from a terminal on a system where `ydbinstall --zlib` produced this error message.
      01df2a19
  6. 12 Jul, 2021 1 commit
  7. 08 Jul, 2021 5 commits
    • Narayanan Iyer's avatar
      [r1.32] Prepare for YottaDB r1.32 release · 3aec5570
      Narayanan Iyer authored
      3aec5570
    • Narayanan Iyer's avatar
      [r1.32] Enhance ydbinstall.sh to handle new binary tarball naming convention · a8139fd7
      Narayanan Iyer authored
      * The tarball naming convention changed in r1.32. Previously, we used to default to a platform
        of `linux` and use a tarball that has that name in it. Now there is no `linux` tarball. All
        tarballs have the linux distribution name in them (e.g. `ubuntu`, `debian` etc.) and we only
        pick a tarball which has the target system's linux distribution.
      
        If we don't find a tarball with a matching distribution, we issue an error (pre-existing logic).
      
      * For comparison, below are the names of the binary tarballs for various supported platforms
        between YottaDB r1.30 and r1.32.
      
        ```
        r1.30
        ------
        yottadb_r130_linux_aarch64_pro.tgz
        yottadb_r130_linux_armv6l_pro.tgz
        yottadb_r130_linux_armv7l_pro.tgz
        yottadb_r130_centos8_x8664_pro.tgz
        yottadb_r130_debian10_x8664_pro.tgz
        yottadb_r130_linux_x8664_pro.tgz
        yottadb_r130_rhel7_x8664_pro.tgz
        yottadb_r130_ubuntu2004_x8664_pro.tgz
      
        r1.32
        -----
        yottadb_r132_aarch64_debian11_pro.tgz
        yottadb_r132_aarch64_ubuntu2004_pro.tgz
        yottadb_r132_armv6l_debian11_pro.tgz
        yottadb_r132_x8664_debian11.tgz
        yottadb_r132_x8664_rhel7_pro.tgz
        yottadb_r132_x8664_rhel8_pro.tgz
        yottadb_r132_x8664_ubuntu2004_pro.tgz
        ```
      
      * Additionally, ydbinstall.sh treats `ARMV7L` as equivalent to `ARMV6L` in terms of picking a tarball.
        And treats `CentOS` as equivalent to `RHEL`.
      
      * Also, the logic that recognizes the older binary tarball names in case of pre-r1.32 YottaDB release
        installs is left untouched by moving it into a separate `else` block.
      a8139fd7
    • Narayanan Iyer's avatar
      [r1.32] Update minimum supported OS versions on various architectures in ydbinstall.sh · 665d4ffe
      Narayanan Iyer authored
      * For `x86_64` architecture
        - `Ubuntu 20.04 LTS` replaces `Ubuntu 18.04 LTS` as the Supported OS level.
        - `RHEL 8` replaces `CentOS 8` as the Supported platform.
        - `Debian 11` replaces `Debian 10` as the Supported OS level.
      
      * For `aarch64` architecture
        - `Ubuntu 20.04 LTS` replaces `Ubuntu 18.04 LTS` as the Supported OS level.
        - `Debian 11` is a Supported OS level.
      
      * For `armv7l` architecture
        - There is no longer an `armv7l` specific distribution. Use the `armv6l` distribution for `armv7l` platforms.
      
      * For `armv6l` architecture
        - `Debian` replaces `Raspbian` as a Supported OS.
        - `Debian 11` replaces `Debian 9` as the Supported OS level.
      
      * Note that one can bypass the checks for Supported Linux distributions and minimum OS versions using `ydbinstall --force-install` if one has confidence that the platform will run the selected distribution.
      
      * Also note that if one does a build from source of YottaDB on an Supportable platform, as long
        as the build runs clean, `ydbinstall` does not require `--force-install`, which it did previously.
      665d4ffe
    • Brad Westhafer's avatar
      [#654] Remove checks for libtinfo.so.5 on Ubuntu as YottaDB should now be... · eebfdf23
      Brad Westhafer authored
      [#654] Remove checks for libtinfo.so.5 on Ubuntu as YottaDB should now be built with libtinfo.so.6 on Ubuntu 20.04 LTS.
      
      This commit removes the requirement for libtinfo.so.5 to be installed on Ubuntu. In Ubuntu 18.10, the libtinfo library was upgraded to version 6 which caused installs of older YottaDB versions built with the then-supported Ubuntu 18.04 LTS to fail with a %YDB-E-DLLNOOPEN because it couldn't find version 5 of libtinfo.so. Since the supported version of Ubuntu will be 20.04 LTS going forward, there is no longer a need to look for an old version of libtinfo.
      eebfdf23
    • Narayanan Iyer's avatar
      [#755] MUPIP REORG -TRUNCATE can cause database damage in very rare cases · b137a1a3
      Narayanan Iyer authored
      Background
      ----------
      * This is an issue that was noticed in internal testing of the `stress/concurr` subtest (see
        `stress/u_inref/concurr.csh` in `YDBTest` repo for test details).
      
      * When the test is run with a Debug build of YottaDB, it does a lot of cache recoveries (forced by the
        test using white-box variables) to really stress the cache recovery logic (`wcs_recover()`). The test
        also runs `mupip reorg -truncate` in the background both on the source side and the receiver side of
        a replicated environment.
      
      * In rare cases (once in 100 test runs), the update process on the receiver side failed occasionally
        with the following assert failure.
      
        ```c
        %YDB-F-ASSERT, Assert failed in sr_port/gvcst_search.c line 450 for expression (CDB_STAGNATE > t_tries)
        ```
      
      * The symptom was that the update process read the root block of a GVT (Global Variable Tree) and found
        the block number to have a level of 0 which is an out-of-design state for a root block (it has to
        have a level of at least 1 or more). And it found this to be the case even in the final retry when
        it holds the critical section lock on the entire region.
      
      * The issue turned out to be that the `mupip reorg -truncate` that was running concurrently in the
        background had moved the root block but the update process did not recognize this (and re-read from
        the new root block). And so it kept trying to use the old/stale root block and found that block had
        now become a level-0 block.
      
      * Even though, we saw an assert failure in this case, this issue has the potential of causing database
        damage if the stale root block turned out to be a level-1 or higher block. In that case, we could
        potentially continue using the wrong root block and update the wrong portions of the database file
        resulting in integrity errors.
      
      Analysis
      --------
      * After some analysis and review of the code, I had a theory on the root cause of the issue (an
        interesting edge case). Every process reads the root block of a GVT once (by going through the directory
        tree) and from then on caches this value thereby avoiding a directory tree traversal. The root block
        of a GVT stays the same for the most part except when MUPIP REORG -TRUNCATE operates. That can move
        root blocks around as part of truncating the database file size. In this case, we have a mechanism
        for the truncate process to signal other concurrent yottadb processes to invalidate their cache of GVT
        root blocks. This is done by bumping the root cycle in the shared memory field `cnl->root_search_cycle`.
      
      * Each yottadb process maintains a copy of this shared field in a private field `csa->root_search_cycle`
        at db open time. And as part of validating each database reference/update (in t_end.c, tp_tend,c or
        tp_hist.c), we check if the shared and private cycles match (done using the `MISMATCH_ROOT_CYCLES` macro
        check). If they don't, it means a concurrent reorg process did a truncate and we have to invalidate our
        private cache. This is recorded in the yottadb process by the special restart code `cdb_sc_gvtrootmod2`
        (line 467 below). We then copy over the shared cycle into the private cycle (using the `SYNC_ROOT_CYCLE`
        macro at line 475 below).
      
        ```c
        sr_port/tp_hist.c
        -----------------
            453         if (MISMATCH_ROOT_CYCLES(csa, cnl))
                        {
              .
            467                 status = cdb_sc_gvtrootmod2;
              .
            475                         SYNC_ROOT_CYCLES(csa);
              .
            478         }
            479         if (si->start_tn <= cnl->last_wcs_recover_tn)
            480         {
              .
            488                 status = cdb_sc_wcs_recover;
            489         }
              .
            495         return status;
        ```
      
      * The restart code `cdb_sc_gvtrootmod2` is handled specially by tp_restart.c (line 674 below) by calling
        the `RESET_ALL_GVT_CLUES` macro. This invalidates the cache of root blocks so the next database reference
        will automatically re-read the directory tree to find the new GVT root block number.
      
        ```c
        sr_port/tp_restart.c
        --------------------
            671                 switch (status)
            672                 {
            673                         case cdb_sc_gvtrootmod2:        /* restarted due to MUPIP REORG moving root blocks */
            674                                 RESET_ALL_GVT_CLUES;
        ```
      
      * This is the general scheme and works well for the most part. In t_end.c and tp_tend.c, the restart code
        is properly detected and signaled/returned to the caller for later handling by t_retry/tp_restart.
      
      * But in tp_hist.c, there is an issue. And that is line 488 above. If the if check at line 479 is
        TRUE, we could potentially `overwrite` the `status` variable from its original value of `cdb_sc_gvtrootmod2`
        to "cdb_sc_wcs_recover". And that means we would return a modified restart code to tp_restart.c
        which would then not do the RESET_ALL_GVT_CLUES macro. This would then cause the cache to not be
        invalidated. But since we already did the SYNC_ROOT_CYCLES invocation at line 475 below, the yottadb
        process would see the private cycle in sync with the shared cycle. The consequence of this is that
        future database reads will use the stale root block number and end up in errors which could even be
        database damage.
      
      * For this overwrite of restart codes in tp_hist.c to happen, we need a `mupip reorg -truncate` (to
        trigger the `cdb_sc_gvtrootmod2` restart code) and at the same time a `wcs_recover()` call (to trigger the
        `cdb_sc_wcs_recover` restart code). The stress_1/concurr subtest triggers both these events by running
        reorg -truncate processes in the background even on the secondary side and inducing wcs_recover()
        calls by a white-box env var which simulates an error in the commit logic (and tests that it is
        handled by finishing the commit properly inspite of the error).
      
      Main Fix
      --------
      * The fix is in `tp_hist.c` to not `overwrite` the `status` variable if it already has a value that
        is not `cdb_sc_normal`. This way if it had a value of `cdb_sc_gvtrootmod2` at line 467, that is
        what will get returned by `tp_hist()` and will eventually get propagated to `tp_restart()` so it
        will take care of invoking `RESET_ALL_GVT_CLUES` and invalidating the cache of root block numbers
        that this process had already built up.
      
      Fix 2 (Debug-only issue)
      ------------------------
      * The same `stress_1/concurr` subtest that failed and exposed the above longstanding issue also
        failed with another symptom shown below.
      
        ```diff
        > hostname:stress_1_90/concurr/stress_reorg1.out
        > %YDB-F-ASSERT, Assert failed in sr_port/wcs_recover.c line 180 for expression ((CDB_STAGNATE > t_tries) || TREF(ok_to_call_wcs_recover) || process_exiting)
      
      * This turned out to be a longstanding Debug-only issue in the code in `t_retry.c` where we did not
        set the `ok_to_call_wcs_recover` variable to `TRUE` around the `GRAB_UNFROZEN_CRIT` call as that
        can in turn call `grab_crit()`. We already do this around similar code that can call `grab_crit()`
        a few lines above the `GRAB_UNFROZEN_CRIT` call so similar statements are now done around the
        `GRAB_UNFROZEN_CRIT` call.
      b137a1a3
  8. 03 Jul, 2021 4 commits
    • Narayanan Iyer's avatar
      [r1.32] Regenerate GTMDefinedTypesInit*.m for sr_x86_64/sr_aarch64/sr_armv6l · cbd36699
      Narayanan Iyer authored
      * Note: `sr_armv6l` and `sr_armv7l` architectures share the same file (the one under `sr_armv7l`).
      cbd36699
    • Narayanan Iyer's avatar
      Fix ci/needs_copyright.sh to not require symbolic link files to need a copyright · 3eb30d08
      Narayanan Iyer authored
      * Took this opportunity to remove an unnecessary check of deleted files in `ci/commit_verify.sh`
        as that check is anyways done by the `ci/needs_copyright.sh` script and is called right
        after the deleted file check.
      3eb30d08
    • Narayanan Iyer's avatar
      Create sr_armv6l as a soft link to sr_armv7l directory · 72850991
      Narayanan Iyer authored
      * This is needed by the script that we use to invoke `sr_unix/scantypedefs.m`. This in turn
        generates the file `GTMDefinedTypesInit.m`.
      
      * Previously this script used to be run on an `armv7l` system which worked because it looked for
        header files under `sr_armv7l` directory.
      
        But in r1.32, we are not doing a 32-bit `armv7l` release. We are instead releasing binary
        tarballs only on 32-bit `armv6l` (the other binary tarball that is released is on `aarch64`).
      
        So the `sr_armv7l` version of `GTMDefinedTypesInit.m` needed to be generated on a system running
        the `armv6l` architecture. But the script now looks for header files (e.g. `emit_code_sp.h`)
        under `sr_armv6l` which is missing (the file exists under the `sr_armv7l` directory).
      
      * To fix this issue and any other such confusions that might arise in the future, it is considered
        safer to create a soft link from `sr_armv6l` to `sr_armv7l` since both architectures use files
        from the exact...
      72850991
    • Brad Westhafer's avatar
      [#497] Remove ydbinstall requirement for --force-install to install a... · 86d2dfba
      Brad Westhafer authored
      [#497] Remove ydbinstall requirement for --force-install to install a distribution built on the same system
      
      This commit eliminates the need to use `--force-install` to install a distribution built on the same system. Previously, ydbinstall would require the `force-install` option to install YottaDB if the system was not supported even if it was installing a build of YottaDB that had just been built on the same system.
      
      To make this change, there were 2 requirements. First, CMake now saves a file called `build_os_release` which is a copy of /etc/os-release for the system where it built YottaDB. Secondly, the ydbinstall script looks at this file and, if the OS and version match, it considers the system supported, removing the need for `--force-install`.
      86d2dfba
  9. 01 Jul, 2021 5 commits
    • Narayanan Iyer's avatar
      Remove hack in sr_unix/ydbenv.mpt now that #709 and #715 are fixed · 5ceca27d
      Narayanan Iyer authored
      * The author of this change is @ksbhaskar.
      
      * The hack was introduced in commit 282b5569 as part of
        changes to `sr_unix/ydbenv.mpt`.
      5ceca27d
    • Narayanan Iyer's avatar
      [#709] $VIEW("GVFILE",<region>) no longer creates database file if <region>... · 901ab769
      Narayanan Iyer authored
      [#709] $VIEW("GVFILE",<region>) no longer creates database file if <region> has AutoDB set in the global directory
      
      * `$VIEW("GVFILE")` used to do a heavyweight open of the database to just find out the full path of the
        database file corresponding to a given region name in the global directory. This had another
        unintended side effect of automatically creating any AUTODB region database files that did not
        already exist.
      
      * All this is now fixed by not doing the database file open (`gv_init_reg()` call) as part of `$VIEW("GVFILE")`.
        Instead we call `dbfilopn()` with a new second parameter set to `TRUE` to indicate that we only want to
        figure out the database file name of the segment after expansion of any `$` usages (env vars) but otherwise
        not open the database file. And to avoid doing this multiple times for each `$VIEW("GVFILE")` call, we
        use a new field `seg_fname_initialized` in the `gd_region` structure to note the fact that the expansion
        has happened once.
      901ab769
    • Narayanan Iyer's avatar
      [#715] MUPIP JOURNAL RECOVER/ROLLBACK BACKWARD work fine even if an AutoDB... · 1ea23782
      Narayanan Iyer authored
      [#715] MUPIP JOURNAL RECOVER/ROLLBACK BACKWARD work fine even if an AutoDB database file does not exist
      
      * See #715 for more background.
      
      * The main fix is to `sr_port/read_db_files_from_gld.c` to check whether the region is an AUTODB region
        and if so, check if the database file does exist. If not, do not include this region in the region list.
      
        In addition, we now issue a `NOREGION` error in case all regions correspond to AUTODB regions and
        none of the corresponding database files exist.
      
      * Because `read_db_files_from_gld` can now return a linked list of regions that has potentially fewer
        regions than the number of regions stored in the gld (`gd_header->n_regions`), the function interface
        has been changed to return this count too. This is used by `sr_port/mur_open_files.c` to update the
        `max_reg_total` variable instead of assuming it is always `gd_header->n_regions`. This ensures
        later logic in `mur_open_files()` processes only the correct number of regions.
      
      * Because of this interface change, another caller `updproc_init.c` had to be changed to pass `NULL`
        as the new second parameter.
      1ea23782
    • Brad Westhafer's avatar
      Fix ydbinstall --aim option to set $ydb_dist · 0310af22
      Brad Westhafer authored
      The ydbinstall script's --aim option did not set $ydb_dist before it installed --aim which caused the YDBAIM installer to look at the package config file to find YottaDB. If newer YottaDB versions are already installed on a machine, this could result in YDBAIM being installed under the wrong version. This commit fixes this by setting $ydb_dist before calling the YDBAIM installer.
      0310af22
    • Narayanan Iyer's avatar
      [#749] Fix MEMORY/GTMASSERT2/JNLCNTRL/REPLJNLCLOSED errors from huge... · 5213c9b1
      Narayanan Iyer authored
      [#749] Fix MEMORY/GTMASSERT2/JNLCNTRL/REPLJNLCLOSED errors from huge transactions in journaled database files
      
      * See #749 for more background on the issue.
      
      * The main fix is to convert all places where `jgbl.cumul_jnl_rec_len` (which maintains the current
        total journal record size requirements of all replicated journal records in the current transaction)
        is incremented to additionally check if the total becomes greater than or equal to 1GiB. If so,
        we issue a new `REPLTRANSJNL1GB` error. This prevents the various issues that were seen before.
        A new macro `INCREMENT_JGBL_CUMUL_JNL_REC_LEN` takes care of this.
      
        This limit was originally set at `2GiB` in a former version of this commit. But we noticed various
        test failures. This is because replication code (e.g. `gtmsource_get_jnlrecs()`) has various length
        related fields hardcoded as type `int` instead of `unsigned int` and various functions returns a
        `-1` value of length to signify an error code path and does `0 > len` in various places. All of
        these will not work right if we support a `2GiB` max length as in some places we allocate twice
        this max length which will make the buffer length greater than `2GiB` and cause the `int` value to
        be treated as a negative value instead of a positive value. To avoid these issues, the limit was
        cut to half of `2GiB` and hence the `1GiB` limit.
      
      * In `sr_port/tp_tend.c` and `sr_port/t_end.c`, the types of the `total_jnl_rec_size` and `tot_jrec_size`
        variables were changed from `uint4` to `gtm_uint64_t` (i.e. 4-byte to a 8-byte unsigned value) and
        various calculations are now typecast as `(uint4)` to avoid values greater than `2GiB` but less than
        `4GiB` to be treated as a negative value. Similar changes were also done in `sr_port/tp.h` to the macro
        `TOTAL_TPJNL_REC_SIZE`. This ensures the correct check happens for the `JNLTRANS2BIG` error.
      
        With the previous checks, it was possible we missed out on issuing a `JNLTRANS2BIG` error due to
        the value overflowing past `4GiB` and circling back to a small positive 4-byte value. We would
        then proceed to commit this transaction and `rsrv_freeaddr` would end up getting bumped to a value
        greater than `4GiB` which would cause it to circle back to a small positive 4-byte value and then
        result in a situation where `jb->freeaddr` is less than `jb->dskaddr` which is an out-of-design
        situation and results in the `JNLCNTRL` and/or `REPLJNLCLOSED` errors.
      
      * `sr_port/tp_tend.c` : `si->tot_jrec_size` field was replaced with a local variable `tot_jrec_size`
        and removed from the `sgm_info` structure in `sr_port/tp.h` as this was the only use of it.
      
      * Changed the type of the global variable `gtmsource_cmpmsgbufsiz` from `int` to `unsigned int` to allow
        greater than `2Gib` (but less than `4Gib`) sizes since the compressed buffer can be at most twice
        the regular buffer size (which can be greater than `1GiB` but less than `2GiB`).
      
        Because of this, an assert in `sr_unix/gtm_zlib.h` was changed to just check for non-zero values (since
        the check for being positive was okay previously for a `int` type but is no longer appropriate for an
        `unsigned int` type).
      5213c9b1
  10. 29 Jun, 2021 2 commits
    • Narayanan Iyer's avatar
      [#727] TZTWORM/UZTWORM journal record reflects post-trigger (not pre-trigger) value of $ZTWORMHOLE · f105073f
      Narayanan Iyer authored
      Background
      ----------
      * The purpose of the $ZTWORMHOLE intrinsic special variable is to provide context from a primary
        instance originating updates to a secondary instance receiving and applying updates. An example
        of context is values of select local variable nodes, since a trigger starts with no local variables.
        If application code sets a value in $ZTWORMHOLE before invoking an update and a trigger invoked by
        the update references $ZTWORMHOLE, the $ZTWORMHOLE value is propagated in the replication stream
        so that a trigger on the receiving instance can use that context to execute its logic.
      
      * However, trigger logic might not just use context generated by pre-trigger application code. It might
        also generate context, e.g., a timestamp. In the current implementation, if a trigger sets $ZTWORMHOLE,
        that value is not propagated in the replication stream; only values set in application code, before
        triggers are invoked, are propagated, and that too only if the trigger logic references them.
      
      * This commit aims to overcome this limitation by propagating the value of $ZTWORMHOLE AFTER the trigger
        has been invoked (instead of the value BEFORE the trigger has been invoked).
      
      * This makes it easy for applications to ensure that the trigger logic on the receiver side has access to
        the same context that the trigger logic on the source side of a replicated environment generated thereby
        ensuring the database updates done inside the trigger on the source and receiver side are identical.
      
      Fixes
      -----
      * The previous flow of a logical update was to invoke `jnl_format(JNL_ZTWORM,...)` as part of the
        `JNL_FORMAT_ZTWORM_IF_NEEDED` macro to insert a provisional ZTWORM journal record in the list of
        formatted journal records. And then insert the logical record (SET/KILL/ZTRIG) in the format list.
        Then invoke the trigger code. And check if $ZTWORMHOLE was referenced in that code AND if it has
        a non-empty value after the trigger returns. If not, we remove the ZTWORM journal record from the
        jnl format list.
      
      * With the new requirement, we have the added need to `update` the already formatted ZTWORM journal record
        with a new value in case it was updated `inside` the trigger code. This meant redoing most of what
        was done by the first call to `jnl_format(JNL_ZTWORM,...)`. But unlike the first call which generates
        a `ztworm_jfb`, the second call needs to pass in a pre-existing `ztworm_jfb`. So this meant a slight
        interface change. But I did not want existing callers of `jnl_format()` to be affected. So what I did
        was to add a new `JNL_ZTWORM_POST_TRIG` (after `JNL_ZTWORM`) to the `jnl_action_code` enum. And pass
        this new opcode for the second call to `jnl_format(...)`. In this second call, we pass in `ztworm_jfb`
        in the second parameter (`key`) which is otherwise `NULL` for the `jnl_format(JNLZTWORM,...)` case.
        And `jnl_format()` was then reworked to handle the new `JNL_ZTWORM_POST_TRIG` opcode (a lot of the
        jfb allocation logic is skipped in this case etc.).
      
      * Various pre-existing clang-tidy warnings in the files affected by the above bullets were fixed.
        A new macro `UNUSED` in `sr_port/gtm_common_defs.h` helps avoid a lot of `Value stored to ... is never read`
        type of warnings.
      
      * `ztwormhole_used` variable was more appropriately renamed to `write_ztworm_jnl_rec` since `ZTWORM`
        journal record is now written even if `$ZTWORMHOLE` is updated inside the trigger logic (not just
        referenced like before).
      
      * `jgbl.save_ztworm_ptr` was nixed and instead a new field `jgbl.pre_trig_ztworm_ptr` was introduced to
        note down pointer to the pre-trigger value of $ZTWORMHOLE stored somewhere in the jnl format list.
      
      * The `JS_NULL_ZTWORM_MASK` bit in the `nodeflags` portion of the journal record is now nixed as it was
        not maintained accurately anyways and results in the bit ending up with different values in the journal
        file on the source vs receiver side (since the update process was relying on this incorrect bit to set
        the `nodeflags` on the receiver side).
      
      * `SET_JGBL_PRE_TRIG_ZTWORM_PTR` macro was renamed to a more appropriate `UPDATE_JGBL_FIELDS_IF_ZTWORM`.
      
      * Since there is encryption related code in the `jnl_format(JNL_ZTWORM_POST_TRIG,...)` reinvocation,
        the `triggers` test was run with `-replic` and `-encrypt` to verify it still runs fine after the above
        code changes.
      
      * The `JNL_FORMAT_ZTWORM_IF_NEEDED` and `REMOVE_ZTWORM_JFB_IF_NEEDED` macros were moved into new files/functions
        `sr_port/jnl_format_ztworm_plus_logical.c` and `sr_port/jnl_format_ztworm_remove_if_needed.c` as they
        became a lot more complex after the above changes (function is easy to debug etc.).
      f105073f
    • Brad Westhafer's avatar
      Fix icu errors when installing YDBPosix, YDBZlib and YDBCrypt plugins in UTF-8 mode · 3a7dfe24
      Brad Westhafer authored
      Internally, we discovered a couple issues with the ydbinstall script when installing YDBPosix, YDBZlib and YDBCrypt plugins together. Firstly, the YDBPosix install would unset $ydb_icu_version causing the installs to fail for YDBZlib and YDBCrypt. Secondly, the YDBZlib plugin would produce %YDB-E-ICUSYMNOTFOUND errors when compiling the M mode version of the plugin if the ICU version was specified as `default`. Both of these are fixed by removing unnecessary calls to ydb_env_set and ydb_env_unset from the YDBPosix install and by moving the command to compile YDBZlib in M mode to after the UTF-8 commands.
      
      The command that failed was
      
      `sudo ./ydbinstall --posix --zlib --encplugin --utf8 default --force-install`
      
      This commit also makes a change to sr_unix/gtm_icu.c based on a review comment to !998. This change was intended to be included in !998 but was accidentally left out and this was not discovered until after !998 had already been merged.
      3a7dfe24
  11. 28 Jun, 2021 1 commit
    • Brad Westhafer's avatar
      [#704] Fix valgrind uninitialized warnings from weekend tests · 33a66747
      Brad Westhafer authored
      The r132/ydb704 test failed with 2 different types of uninitialized variable warnings on pro builds this past weekend.
      
      1. On AARCH64 machines, the test failed because valgrind thought `shm_buf.shm_nattch` was uninitialized in `sr_unix/gds_rundown.c`. The only place where shm_buf was previously used before the uninitialized variable was when it was passed to shmctl in the if condtion before the else where it was supposedly uninitialized. We believe this to be a false alert but have set it to be initialized to 0 to get rid of the valgrind warning. This also caused valgrind to think that `we_are_last_user` was uninitialized as well and throw warnings. All valgrind warnings for this file go away if shm_buf.shm_nattch is initialized.
      2. On some x86_64 machines, valgrind thought the boolean `symbols_renamed` in `sr_unix/gtm_icu.c` was uninitialized on pro builds. For dbg builds, it is initialized to -1 before the for loop. However, on pro builds, it was not initialized before the loop. We believe that this is a false alert and it is not accessed on the first iteration of the for loop because `findx` should be 0 and that on later iterations when `findx` is > 0, it should have already been initialized. This is addressed by initializing it to FALSE for pro builds. After doing this, valgrind stopped throwing warnings for this file.
      33a66747
  12. 24 Jun, 2021 2 commits
    • Brad Westhafer's avatar
      [#704] Fix YDBDISTUNVERIF error when YottaDB is invoked by valgrind · 76851b73
      Brad Westhafer authored
      This commit fixes a %YDB-E-YDBDISTUNVERIF error when invoking YottaDB via valgrind. The problem is that valgrind changes the process_name to `MEMCHECK-AMD64-` or a platform-specific equivalent rather than `YOTTADB`. This causes the $ydb_dist verification to fail. The fix is to disable this check when YottaDB is invoked by valgrind (i.e. the global `process_name` starts with `MEMCHECK`).
      
      This commit also fixes several valgrind uninitialized variable warnings in `valgrind MUPIP INTEG` that were caught by the test.
      76851b73
    • Brad Westhafer's avatar
      Update Debian Dockerfile to use Bullseye instead of Buster · 29976b98
      Brad Westhafer authored
      This commit updates the Debian Dockerfile to use the newer Bullseye release instead of Buster. This is being done because the imptp-rs pipeline in YDBTest is failing due to an outdated version of cargo in Buster that is installing an outdated version of rust which doesn't support mem::take.
      
      The go-pinning code is removed because it is no longer necessary since Bullseye's version of go is 1.15 which is newer than the minimum required version (1.13).
      
      The bitnami/minideb image was also removed. While it reduced the file size from 1.41Gb to 1.36Gb, it had to be removed due to pipeline failures.
      29976b98
  13. 21 Jun, 2021 1 commit
    • Brad Westhafer's avatar
      [#744] Limit %CDS^%H to dates through 12/31/9999 · 164699b0
      Brad Westhafer authored
      This commit limits %CDS^%H to dates through 12/31/9999 (%DT=2980013). This limitation is needed because the year only supports 4 digits so this function would overflow to 1/1/0000. For dates after 12/31/9999 (%DT of 2980014 or higher), %CDS^%H returns an empty string.
      164699b0
  14. 17 Jun, 2021 1 commit
    • Brad Westhafer's avatar
      [#744] Make %CDS^%H work for %DT>94657 · 083afb59
      Brad Westhafer authored
      This commit fixes a %CDS^%H bug that returned an empty string for %DAT when %DT was greater than 94657 (equivalent to a date February 28th, 2100). This returned an empty string because %CDS explicitly checked that %DT was less than 94658. The fix involved removing this check which allows 94568 to correctly return a date of March 1st, 2021 instead of an empty string. Values should work up to the maximum for $ZDATE which is called by %CDS^%H.
      083afb59
  15. 04 Jun, 2021 3 commits
    • Narayanan Iyer's avatar
      Enhance JNLBUFINFO syslog message to record journal buffer size to help with... · 051d3f8d
      Narayanan Iyer authored
      Enhance JNLBUFINFO syslog message to record journal buffer size to help with debugging JNLCNTRL errors
      
      * In the journal buffer, `jb->dsk` is always supposed to be in sync with `jb->dskaddr`. That is, `jb->dskaddr`
        modulo `jb->size` (the journal buffer size) should always be equal to `jb->dsk` except for a short window
        of instructions when one of the two fields is being updated for an in progress journal file disk write.
      
      * A `JNLCNTRL` error is currently issued in the syslog if ever `jb->dsk` gets out of sync with `jb->dskaddr`.
      
      * In that case, the accompanying `JNLBUFINFO` message records a lot of the journal buffer fields. This will help
        analyze the cause. An example sequence of such syslog messages is pasted below.
      
        - %YDB-I-JNLSENDOPER, pid = 0x00004B10 : status = 0x08F691A2 : jpc_status = 0x00000000 : jpc_status2 = 0x00000000 : iosb.cond = 0x0000, %YDB-E-JNLCNTRL, Journal control unsynchronized for dbregion.mjl.
        - %YDB-I-JNLBUFINFO, Pid 0x00004B10 dsk 0x0046A8D8 free 0x00EC9430 bytcnt 0x84173698 io_in_prog 0x00000000 fsync_in_prog 0x00000000 dskaddr 0xFF36B8D8 freeaddr 0x0BEBE430 qiocnt 0x0278B8B0 now_writer 0x00000000 fsync_pid 0x00000000 filesize 0x007FFFFF cycle 0x0000015E errcnt 0x00000000 wrtsize 0x0046A8D8 fsync_dskaddr 0xFF36B8D8 rsrv_free 0x00EC9430 rsrv_freeaddr 0x0BEBE430 phase2_commit_index1 0x0000039B phase2_commit_index2 0x0000039B next_align_addr 0xFFFFFFF0
      
      * But `jb->size` is not part of the above `JNLBUFINFO` output.
      
      * This commit fixes that shortcoming by recording that too in the `JNLBUFINFO` message.
      051d3f8d
    • Narayanan Iyer's avatar
    • Narayanan Iyer's avatar
      [#743] Fix cause of <%SYSTEM-E-ENO22, Invalid argument> error from shmdt() in very rare cases · af57a3c2
      Narayanan Iyer authored
      Background
      ----------
      * In-house testing on an AARCH64 system showed a test failure in the `stress_1/concurr` subtest with
        the following failure symptom from a `mupip integ -online`.
      
        ```
        %YDB-E-SYSCALL, Error received from system call shmdt() -- called from module sr_unix/ss_context_mgr.c at line 198
        %YDB-I-TEXT, Shared segment address: 0x455453595325202c
        %SYSTEM-E-ENO22, Invalid argument
        ```
      
      * After some analysis, found one possible explanation of the cause. The function `jnl_file_close_timer()`
        did not have a `shmdt()` invocation until `GT.M V6.3-007` changes were merged into YottaDB (in r1.26
        release). The function/macro call sequence that makes this possible is
        `jnl_file_close_timer()` -> `SS_RELEASE_IF_NEEDED` -> `ss_destroy_context()` -> `shmdt()`.
      
      * So if we are in exit handling code invoking `ss_release()` -> `ss_destroy_context()`, it is possible a
        timer interrupt happens just before we are about to do the `shmdt()` and the `jnl_file_close_timer()`
        code gets invoked as part of the timer handler and does the `shmdt()`. In this case, once control
        returns back from the timer interrupt, the outer `shmdt()` call will proceed with no clue about the
        nested `shmdt()` call and in turn get the `%SYSTEM-E-ENO22` (aka `EINVAL`) error.
      
      Fix
      ---
      * The issue is that `lcl_ss_ctx->attach_shmid` is a global variable that is used by the `ss_destroy_context()`
        function to determine if a `shmdt()` is needed. But it is possible the function is called recursively
        due to a timer handler and, if the timing is right just before the `SHMDT()` call, the nested invocation
        would look at the same global variable and see that it holds a valid shmid and invoke `SHMDT()` and then
        return to non-interrupt code which will continue with the `SHMDT()` call oblivious of the fact that one
        already happened.
      
      * To fix this, we note down the `lcl_ss_ctx->attach_shmid` global variable in a local variable and clear the
        global variable. And then use the local variable to check if a `SHMDT()` is needed. And do the noting/clear
        steps inside a `DEFER_INTERRUPTS`/`ENABLE_INTERRUPTS` window to prevent timer interrupts.
      
      * A similar fix is also done for `lcl_ss_ctx->shdw_fd`. In this case though, there is no user-visible issue
        since the `CLOSEFILE_RESET()` macro does not issue an error in case of an error return from the `close()`
        call (due to duplicate `close()`). But it is better to have clean code and so is fixed.
      af57a3c2
  16. 03 Jun, 2021 1 commit
    • Narayanan Iyer's avatar
      [#742] Non-zero ZHALT argument gets bubbled up as ydb_ci() return value; Allow... · cb99790b
      Narayanan Iyer authored
      [#742] Non-zero ZHALT argument gets bubbled up as ydb_ci() return value; Allow for 4-byte (and not 1-byte) ZHALT argument in this case
      
      * The commit message of 0b4e31e9 mentions the following changes.
      
        - ZHALT no longer halts but returns its argument to the call-in caller as the return value.
      
        - sr_port/goframes.c
          - Change the logic where a frame expecting a return code is unwound and a value placed
            such that if in a call-in and ZHALT has specified a return value, use that value
            instead of the default NULL/0 value. Allows ZHALT to return its value to call-in user.
      
        - sr_port/op_halt.c
          - The return code back to dm_start should be SUCCESS (1) instead of zero to prevent
            ydb_ci[p]() from treating this as an error return which bypasses setting all output
            values (including return values) that the call-in was expected to fill. HALT
            depends on unwind processing in goframes() to supply a return value if one was
            expected from this call-in.
      
        - sr_port/op_zhalt.c
          - Set the return code to dm_start to be SUCCESS instead of the argument value. The
            argument value is saved in the gtmci_retval thread global for goframes() to find
            and use.
          - Return SUCCESS to dm_start().
      
      * There are a lot of changes to the above design. They are described below.
      
        - At a high level, `ZHALT` continues to no longer halt the process if invoked inside a `ydb_ci()` call.
          But its argument is now treated differently. The reason is that, in my opinion, a non-zero argument to
          `ZHALT` should be considered as an abnormal termination of the `ydb_ci()`/`gtm_ci()` call and so is now
          forwarded as the return value of `ydb_ci()` (instead of the prior behavior of modifying the return value
          of the function that is being invoked by `ydb_ci()`).
      
        - `sr_port/gtm_threadgbl_defs.h` : `TREF(gtmci_retval)` used to be an `mval *`. It has been renamed to
          `TREF(zhalt_retval)` (since the only purpose of this global is to store a ZHALT argument and not much
          to do with gtmci processing). And is of type `int` and is used to store the `zhalt` argument. This
          simplifies a lot of the processing in various files that previously used `TREF(gtmci_retval)`. There is
          no longer a need to store this in the stringpool and/or mark the stringpool as usable/unsable (using the
          `DBG_MARK_STRINGPOOL_UNUSABLE`/`DBG_MARK_STRINGPOOL_USABLE` macros) for short periods of time (from
          `op_zhalt()` to `goframes()`).
      
          Because of this change (what used to be an 8-byte pointer type now became a 4-byte integer type) there
          were a few changes in the clang-tidy warning report. This meant changes to the following files.
          - ci/tidy_warnings_debug.ref
          - ci/tidy_warnings_release.ref
      
        - `sr_port/go_frames.c` : This previously used to copy the zhalt argument as the return value of the function
          invoked in the `ydb_ci()` call. But now that we return this as the `ydb_ci()` return value itself, there
          is no need of fancy `TREF(zhalt_retval)` processing. All we do now is to set the return value of the function
          to `literal_null` in case a return value is expected.
      
          Additionally, because of this design change, we no longer issue a `NOTEXTRINSIC` error in case a `zhalt`
          is done with an argument while inside a call-in function invocation. This is because the `zhalt` argument
          has to do with the `ydb_ci()` return value and nothing to do with the call-in function return value.
      
        - `sr_port/mdb_condition_handler.c` : Code that cleared `TREF(gtmci_retval)` to NULL was no longer deemed
          necessary. Not clear why it was considered necessary in the first place.
      
        - `sr_port/op_zhalt.c` : We now note down the integer argument of `ZHALT` directly into the global variable
          `TREF(zhalt_retval)`. No need of any integer to string conversion and/or stringpool processing.
      
          A `ZHALT 0`	is considered as a success. A `ZHALT x` where `x` is a non-zero value is considered a failure.
          And `ydb_ci()` accordingly returns this non-zero value.
      
          Additionally `ZHALT` now accepts a 4-byte integer code (not just a 1-byte integer code like before).
          This is useful in case of a return to a `ydb_ci()` call which is fine with a 4-byte integer return value.
          If there is no `ydb_ci()` involved, any `ZHALT` return code more than 255 would be trimmed down to
          a 1-byte return code like before.
      
        - `sr_unix/gtmci.c` : The new global variable `TREF(zhalt_retval)` is reset to 0 just before the `dm_start()`
          call in `ydb_ci_exec()` and `ydb_cij()`. And immediately after `dm_start()` returns, we check if this
          global variable is set to a non-zero value. If so, that is a `ZHALT` with a non-zero argument and we forward
          this argument as the return value of `ydb_ci_exec()`/`ydb_cij()`.
      
          Note that in this error return case, we do not initialize any return parameter and/or output parameters
          of the actual `ydb_ci()` call. This means, none of those parameter values are reliable in case `ydb_ci()`
          returns with a non-zero error code.
      cb99790b
  17. 02 Jun, 2021 2 commits
    • Narayanan Iyer's avatar
      [#695] Avoid REQRLNKCTLRNDWN error in case relinkctl file was created but... · dd9a037b
      Narayanan Iyer authored
      [#695] Avoid REQRLNKCTLRNDWN error in case relinkctl file was created but initialization was not complete
      
      Background
      ----------
      * While analyzing an in-house test failure, I noticed a `REQRLNKCTLRNDWN` error on an ARMV7L system.
        This was even after the fixes in !928 (1eb0c717e85c0c67f751ff8eeeef2a6cdab0deca) were merged.
      
        ```sh
        $ cat v63007_1/gtm8665/safegde_Vxxx_R131_17788_20210530054338_0.outx
        %YDB-E-REQRLNKCTLRNDWN, Error accessing relinkctl file /tmp/relinkdir/user/ydb-relinkctl-7fea786e90d1dd8a6e28e0bd466452c1 for $ZROUTINES directory v63007_1/gtm8665. Must be rundown
        ```
      
      * After some analysis, found out a simple test case that uses the debugger but reproduces the issue. Pasted below.
      
        ```sh
        $ cat x.m
                set $etrap="zwrite $zstatus set $ecode="""""
                set $zroutines=".*"
      
        $ gdb $ydb_dist/yottadb
        Reading symbols from yottadb...
        (gdb) b ftruncate
        Function "ftruncate" not defined.
        Breakpoint 1 (ftruncate) pending.
        (gdb) r -run x
        Starting program: yottadb -run x
        [Thread debugging using libthread_db enabled]
        Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
      
        Breakpoint 1, __ftruncate64 (fd=3, length=336) at ../sysdeps/unix/sysv/linux/ftruncate64.c:28
        28      ../sysdeps/unix/sysv/linux/ftruncate64.c: No such file or directory.
        (gdb) where
        #0  __ftruncate64 (fd=3, length=336) at ../sysdeps/unix/sysv/linux/ftruncate64.c:28
        #1  0x00007ffff71d43db in relinkctl_open (linkctl=0x7ffffffc4ab0, object_dir_missing=0) at sr_unix/relinkctl.c:340
        #2  0x00007ffff71d3684 in relinkctl_attach (obj_container_name=0x7ffffffc7e20, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:215
        #3  0x00007ffff78559e6 in zro_load (str=0x555555585950) at sr_unix/zro_load.c:193
        #4  0x00007ffff74dc8e5 in op_svput (varnum=23, v=0x555555585940) at sr_port/op_svput.c:177
        (gdb) quit
        A debugging session is active.
      
                Inferior 1 [process 35441] will be killed.
      
        Quit anyway? (y or n) y
      
        $ $ydb_dist/yottadb -run x
        $ZSTATUS="150383578,+2^x,%YDB-E-REQRLNKCTLRNDWN, Error accessing relinkctl file /tmp/ydb-relinkctl-35337a26ca9941eed6b7782ee5f9199d for $ZROUTINES directory /tmp. Must be rundown"
        ```
      
      * This is a case where the process that created the relinkctl file got killed before it could finish
        the various steps involved in the creation. This causes later processes that access the incomplete
        relinkctl file to issue the `REQRLNKCTLRNDWN` error.
      
      * Note that the test that failed does a `kill -15` (i.e. `mupip stop`). And not `kill -9` like the debugger
        would do in the above test case. But it has the same effect in that a `kill -15` can create orphaned
        relinkctl files. Since a `mupip stop` can create such files, it is all the more important that we fix this
        issue (if it was just `kill -9` that can create this, it would not have been as important to fix this as
        it is not a recommended operation).
      
      Fix
      ---
      * The test case described above is a scenario where the relinkctl file exists (i.e. `rctl_exists` is `TRUE`
        in the file `sr_unix/relinkctl.c`) but a field in the file header indicates that initialization is not yet
        complete (i.e. `hdr->initialized` is `FALSE`).
      
      * In this scenario, `relinkctl_open()` currently sleeps for `MAX_RCTL_INIT_WAIT_RETRIES` (i.e. 1000) iterations
        each `1 milli second` long for a total of around `1 second` before giving up and proceeding with the open.
        But it will soon fall through into code that issues the `REQRLNKCTLRNDWN` error.
      
      * The fix is to set `rctl_existed` to `FALSE` before we fall through. And also set a few other related variables
        to appropriate values so we use them to create shared memory and semaphores with the right user/permissions.
        And most importantly we do not issue any `REQRLNKCTLRNDWN` error in this case.
      dd9a037b
    • Narayanan Iyer's avatar
      Fix [-Wmisleading-indentation] and [-Warray-parameter=] compiler warnings from gcc 11.1.0 · 31a210bd
      Narayanan Iyer authored
      * After patching/upgrading an in-house Arch Linux system, we noticed the gcc version got bumped from
        `10.2.0` to `11.1.0`. Along with that came a few new warnings on pre-existing code.
      
      * Below is a warning in `sr_unix/iopi_open.c`. There is no real issue. But the indentation in the macro
        is now fixed to avoid this warning.
      
        ```c
        sr_unix/iopi_open.c: In function 'parse_pipe':
        sr_unix/iopi_open.c:79:17: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
           79 |                 if (NULL != buf)  free(buf); if (NULL != dir_in_path) free(dir_in_path);if (NULL != command2) free(command2); }
              |                 ^~
        sr_unix/iopi_open.c:258:25: note: in expansion of macro 'FREE_ALL'
          258 |                         FREE_ALL;
              |                         ^~~~~~~~
        sr_unix/iopi_open.c:79:46: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
           79 |                 if (NULL != buf)  free(buf); if (NULL != dir_in_path) free(dir_in_path);if (NULL != command2) free(command2); }
              |                                              ^~
        sr_unix/iopi_open.c:258:25: note: in expansion of macro 'FREE_ALL'
          258 |                         FREE_ALL;
              |                         ^~~~~~~~
        sr_unix/iopi_open.c:79:17: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
           79 |                 if (NULL != buf)  free(buf); if (NULL != dir_in_path) free(dir_in_path);if (NULL != command2) free(command2); }
              |                 ^~
        sr_unix/iopi_open.c:265:9: note: in expansion of macro 'FREE_ALL'
          265 |         FREE_ALL;
              |         ^~~~~~~~
        sr_unix/iopi_open.c:79:46: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
           79 |                 if (NULL != buf)  free(buf); if (NULL != dir_in_path) free(dir_in_path);if (NULL != command2) free(command2); }
              |                                              ^~
        sr_unix/iopi_open.c:265:9: note: in expansion of macro 'FREE_ALL'
          265 |         FREE_ALL;
              |         ^~~~~~~~
        ```
      
      * Below is a warning in `sr_unix/mem_access.c`. There was indeed a bound mismatch between the declaration and
        definition. This is now fixed in `sr_port/mem_access.h` to declare the array size as `[2]` instead of `[]`.
      
        ```c
        sr_unix/mem_access.c:16:34: warning: argument 1 of type 'unsigned char *[2]' with mismatched bound [-Warray-parameter=]
           16 | void set_noaccess(unsigned char *na_page[2], unsigned char *prvprt)
              |                   ~~~~~~~~~~~~~~~^~~~~~~~~~
        In file included from sr_unix/mem_access.c:13:
        sr_port/mem_access.h:16:34: note: previously declared as 'unsigned char *[]'
           16 | void set_noaccess(unsigned char *na_page[], unsigned char *prvprt);
              |                   ~~~~~~~~~~~~~~~^~~~~~~~~
        sr_unix/mem_access.c:28:34: warning: argument 1 of type 'unsigned char *[2]' with mismatched bound [-Warray-parameter=]
           28 | void reset_access(unsigned char *na_page[2], unsigned char oldprt)
              |                   ~~~~~~~~~~~~~~~^~~~~~~~~~
        In file included from sr_unix/mem_access.c:13:
        sr_port/mem_access.h:17:34: note: previously declared as 'unsigned char *[]'
           17 | void reset_access(unsigned char *na_page[], unsigned char oldprt);
              |                   ~~~~~~~~~~~~~~~^~~~~~~~~
        ```
      31a210bd
  18. 28 May, 2021 2 commits
    • Narayanan Iyer's avatar
      [#741] SIG-11 from DSE REMOVE -RECORD when block has DBCOMPTOOLRG integrity errors · 9b6a1ce5
      Narayanan Iyer authored
      Background
      ----------
      * Below is a simple example that demonstrates the issue.
      
      * We first create a leaf level block in the Directory Tree (`Block 2`) with the `DBCOMPTOOLRG` integrity error.
      
        ```sh
        $ export ydb_gbldir=yottadb.gld
        $ rm -f yottadb.gld yottadb.dat
        $ $ydb_dist/yottadb -run GDE 'change -segment DEFAULT -file=yottadb.dat'
        $ $ydb_dist/mupip create
        $ $ydb_dist/yottadb -run %XCMD 'set ^x=1'
        $ $ydb_dist/dse add -block=3 -rec=2 -key=^x -data="\FF\FF\FF\00"
        $ $ydb_dist/dse dump -block=3
      
        Block 3   Size 20   Level 0   TN 2 V6
      
        Rec:1  Blk 3  Off 10  Size 8  Cmpc 0  Key ^x
              10 : |  8  0  0  0 78  0  0 31                                    |
                   |  .  .  .  .  x  .  .  1                                    |
      
        Rec:2  Blk 3  Off 18  Size 8  Cmpc 3  Key ^x
              18 : |  8  0  3  0 FF FF FF  0                                    |
                   |  .  .  .  .  .  .  .  .                                    |
      
        $ $ydb_dist/mupip integ -reg "*"
      
        Block:Offset Level
        %YDB-E-DBCOMPTOOLRG,         Nature: #Data
               3:18     0  Record has too large compression count
                           Directory Path:  1:10, 2:10
                           Path:  4:10, 3:18
        Keys from ^x to the end are suspect.
        ```
      
      * We basically added a bad record (`Rec: 2` in `Blk 3`) using `dse`. Now if we try and remove the
        pre-existing good record (`Rec: 1` in `Blk 3`) we get a SIG-11.
      
        ```sh
        $ $ydb_dist/dse remove -block=3 -rec=1
        %YDB-F-KILLBYSIGSINFO1, DSE process 64418 has been killed by a signal 11 at address 0x00007F667334495E (vaddr 0x00005590C2F80000)
        %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
        ```
      
      * This is a day one issue in `dse remove -record`.
      
      * At this point, it is not clear if the issue exists in index blocks too or not (Level 1 and more). But
        the expectation is that it does.
      
      Fix
      ---
      * The issue turned out to be in `sr_port/dse_rmrec.c` where we have a loop that scans a key until we find
        the second zero byte. We have code that tries to limit the scan until the end of the current record but
        there is a bug in that we could go 1 byte past the end of the current record in case we do not find
        two zero bytes within the record limits (possible in case of a block with an integ error like the
        `DBCOMPTOOLRG` error seen in the above example).
      
        And because of this `key_top` (pointer to the end of the key in the current record) ended up being 1 byte
        after `r_top` (pointer to the end of the current record).
      
        This disturbed the calculations that a call to `memmove()` (a few lines later) ended up invoking it with
        a size of `-1` and caused the SIG-11 because the negative value got interpreted as a huge positive value
        but there was not enough accessible memory for that length to be moved.
      
        The issue happened if after removing the record, the only record we are left with is a bad record where
        the end has the 1st zero byte but not the 2nd zero byte.
      
      * The fix for this is to check during the scan if we ever go beyond the record limit (`r_top`) anytime
        we do a `key_top++` (which is twice in each iteration of the loop). One check was already there in the
        for loop termination condition. The other was missing and is now added in between two successive
        `key_top++` usages in the if condition.
      
      * While fixing this, I noticed similar logic exists in various other `dse_*.c` modules. Searched for the
        pattern `for.*key_top++.*key_top++` to find all such usages and fixed those too. It is unlikely those
        have any issues. If at all, I expect `dse add -record` to have issues. But not sure. Not spending the
        time trying to figure out if there is a user-visible issue in those other cases or not. I know though
        that the revised code is definitely safer and guaranteed not to go beyond record boundaries. Code
        without this fix might work fine in most cases as 1 byte more than a record boundary would usually be
        accessible and so `dse dump` (for example) might just report incorrect data in that rare case if at all
        there is an issue.
      9b6a1ce5
    • K.S. Bhaskar's avatar
      [#740] Change YDBOCTO region default to 1MiB record size, 1019 bytes key size,... · d7ca3b3b
      K.S. Bhaskar authored
      [#740] Change YDBOCTO region default to 1MiB record size, 1019 bytes key size, 2048 byte block size, 10000 allocation, 20000 extension, 2000 global buffers
      d7ca3b3b
  19. 27 May, 2021 1 commit
    • Narayanan Iyer's avatar
      [#629] Unary + operator on lvn returned by $ORDER with a value of $ZYSQLNULL... · 792390b4
      Narayanan Iyer authored
      [#629] Unary + operator on lvn returned by $ORDER with a value of $ZYSQLNULL correctly returns $ZYSQLNULL
      
      Background
      ----------
      * Below is a simple example illustrating how the `+` operator on `$ZYSQLNULL` results in a `$ZYSQLNULL` value.
      
        ```m
        YDB>set x=$ZYSQLNULL
        YDB>set y=+x
        YDB>zwrite x,y
        x=$ZYSQLNULL
        y=$ZYSQLNULL
        ```
      
      * But this stops working if `$ZYSQLNULL` got returned by a `$ORDER` on an lvn. Below is an example. The
        variable `y` holds the value `$ZYSQLNULL` initially after the `$ORDER` operation. But once the `set
        z=+y` is done, the variable `y` holds the value `""` instead and the result of the `+` in the variable
        `z` ends up being `0`. We expect `z` to hold the value `$ZYSQLNULL`.
      
        ```m
        YDB>set x($ZYSQLNULL)=""
      
        YDB>set y=$ORDER(x(""))
      
        YDB>zwrite x,y
        x($ZYSQLNULL)=""
        y=$ZYSQLNULL
      
        YDB>set z=+y
      
        YDB>zwrite y,z
        y=""
        z=0
        ```
      
      * This issue was found while trying to fix YottaDB/DBMS/YDBOcto#574 (as part of YottaDB/DBMS/YDBOcto!743)
        to use the `+` operator on all numeric operands to force numeric coercion. The incorrect
        value of the `+` computation caused a query to execute incorrectly. Because of this issue,
        YottaDB/DBMS/YDBOcto!743 was fixed to handle `$ZYSQLNULL` first before attempting the `+` operator
        (in the `$$ForceNumeric^%ydboctoplanhelpers` function).
      
      * In Debug builds of YottaDB, the above test case results in the following assert failure.
      
        ```c
        %YDB-F-ASSERT, Assert failed in sr_port/s2n.c line 40 for expression (!MV_IS_SQLNULL(u))
        ```
      
      * Later found a much simpler query that demonstrates the issue.
      
        ```sql
        OCTO> SELECT (SELECT NULL::boolean) FROM names ORDER BY 1;
        %YDB-F-ASSERT, Assert failed in sr_port/s2n.c line 40 for expression (!MV_IS_SQLNULL(u))
        ```
      
        It turned out this is easily worked around on the `YDBOcto` side so YottaDB/DBMS/YDBOcto#676 handled the fix.
        And avoided requiring #629 to be fixed.
      
      Fix
      ---
      * If the lvn is set based on a `$ZYSQLNULL` value that was stored as a subscript in an lv tree, the
        `MV_NM` bit that is stored as part of the `literal_sqlnull` mval is lost when `lvAvlTreeNodeInsert()`
        (in `sr_port/lv_tree.c`) determines that the subscript to be stored is not a numeric value. In that
        case it only stores `node->key_mvtype` as the value `MV_STR | MV_SQLNULL`.
      
      * Later when one returned by `$ORDER()` call, the `MV_NM` bit is no longer there so when one invokes
        the `unary +` operator, `sr_x86_64/op_forcenum.s` needs to invoke `s2n()` as part of the `mv_force_num`
        assembly macro (defined in `sr_x86_64/mval_def.si`). And that has an assert that we never see a
        `$ZYSQLNULL` and that fails.
      
      * This assert was added in a prior commit (5b1b5b1b). There is no reason
        for this mentioned in the commit message. My suspicion is that the assert was valid for the test
        cases constructed until then and so it was left around. There is no out-of-design situation that gets
        created in that case.
      
      * Therefore, I went ahead with removing that assert and instead checking if the string length is 0.
        If so, it could be the empty string (`""`) or `$ZYSQLNULL` and based on which it is set the return
        value to `literal_null` (like before) or `literal_sqlnull` (newly done in the `$ZYSQLNULL` casce).
      
      * With these changes, the assert failure no longer happens in Debug builds and in Release builds the
        test case described in the `Background` section works fine.
      
        I also verified by checking out one commit before YottaDB/DBMS/YDBOcto@714f0ea2
        and building `YDBOcto` and using the query in the last bullet of the `Background` section and verifying
        there are no assert failures but instead valid output.
      792390b4