Skip to content

[!377] Do not terminate with a fatal GTMASSERT2 message (and core) in case JOB command fails to create a new process

Narayanan Iyer requested to merge nars1/YDB:forkcore into master

There were 3 fixes. First 2 were to ojstartchild.c and 3rd one was to op_job.c.

  1. While trying to get the manually_start/sem_counter subtest running (it starts 32K processes), we saw the following message in the syslog.

    %YDB-F-GTMASSERT2, YottaDB r998 Linux x86_64 - Assert failed sr_unix/ojstartchild.c line 387 for expression (EAGAIN == errno || ENOMEM == errno)

    After some investigation, found that errno was set to ENOSPC after a fork() failed because the maximum number of processes defined in /proc/sys/kernel/pid_max is set at the default of 32K and this test was creating a situation where the total # of processes in the system was > 32K. Below is the actual error detail corresponding to ENOSPC.

    %SYSTEM-E-ENO28, No space left on device

    The man pages of fork() though indicate ENOSPC is never a possible value of errno. Turns out it is an issue in the Linux kernel (https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1759316.html) that will soon be fixed.

    Nevertheless it is better for YottaDB to not fail with a fatal GTMASSERT2 error in case a system call returns an undocumented errno. It is better to issue a trappable error in that case.

    This fix issues a JOBFAIL error with a SYSCALL error detail that shows up as follows if the fork() done in the parent fails.

    %YDB-E-JOBFAIL, JOB command failure
    %YDB-E-SYSCALL, Error received from system call fork() -- called from module sr_unix/ojstartchild.c at line 383
    %SYSTEM-E-ENO28, No space left on device

    It is also possible that a fork() done in the middle child fails. In that case, the below is the error message that shows up in the parent.

    %YDB-E-JOBFAIL, JOB command failure
    %YDB-I-TEXT, Job error in fork
    %SYSTEM-E-ENO28, No space left on device

    Note that in order to communicate the middle child failure in fork to the parent, a pre-existing (but unused) failure code "joberr_frk" was used.

  2. While at this, the FORK_RETRY macro has been removed. There was no benefit seen to retrying for EAGAIN and ENOMEM errors in case a fork() fails. Besides, it was an indefinite while loop so it was possible the JOB command could be running indefinitely without honoring any user-specified timeout in this case if the system is short of dynamic kernel memory resources. It seems more user-friendly to issue an error when a fork() failure is seen for the first time.

  3. And finally, while reviewing op_job.c, a typo was noticed while issuing a JOBFAIL error (parameters were passed in the wrong order) and that is fixed. This is highly unlikely since that particular error required an error while reading from the pipe (setup between the middle child and the parent) to fail in the first place.

Merge request reports