Large numbers of short-lived processes that access M code using auto-relink work correctly
Final Release Note
Large numbers of short-lived processes that access M code using auto-relink work correctly. Previously, this could result in processes terminating abnormally with GTMASSERT2 errors. Such a workload might be encountered with web-server CGI processes on a fast machine under load. This issue was detected in development and test environments and was never reported by a user. [#872 (closed)]
Description
The v63003/gtm8850
subtest in the YDBTest project failed in internal testing recently with the following error.
5a6,18
> v63003_0_18/gtm8850/x1.out
> %YDB-F-GTMASSERT2, YottaDB r999 Linux x86_64 - Assert failed sr_unix/relinkctl.c line 384 for expression (MAX_RCTL_RUNDOWN_RETRIES > rctl_rundown_count++)
.
.
This happens more often in Release builds of YottaDB than in Debug builds. Eventually, I was able to find a simple enough test case that demonstrates the same problem even with Debug builds in a 16-core in-house system (I could reproduce this failure only on the fastest of systems).
Below is that test case. It comprises of 2 C-shell scripts. test.csh
the main script. And helper.csh
, a helper script used by test.csh
. There are 16 parallel jobs running each of which invoke yottadb
to open a relinkctl file (because gtmroutines
env var is set to .*(.)
) and immediately edit. This causes lots of contention for the relinkctl file which gets created/deleted constantly.
When I run it using a Debug build of YottaDB from the master branch, I get failures like the following.
$ source test.csh
x.mjo0:14:16:10 : %YDB-F-GTMASSERT2, YottaDB r999 Linux x86_64 - Assert failed sr_unix/relinkctl.c line 384 for expression (MAX_RCTL_RUNDOWN_RETRIES > rctl_rundown_count++)
x.mjo3:14:16:10 : %YDB-F-GTMASSERT2, YottaDB r999 Linux x86_64 - Assert failed sr_unix/relinkctl.c line 384 for expression (MAX_RCTL_RUNDOWN_RETRIES > rctl_rundown_count++)
x.mjo8:14:16:10 : %YDB-F-GTMASSERT2, YottaDB r999 Linux x86_64 - Assert failed sr_unix/relinkctl.c line 384 for expression (MAX_RCTL_RUNDOWN_RETRIES > rctl_rundown_count++)
The scripts are pasted below for reference.
test.csh
rm -f core* YDB_FATAL* helper.mj* STOP
echo " " > tmp.m
unsetenv gtm_linktmpdir
setenv gtmroutines ".*(.)"
@ cnt2 = 0
while ($cnt2 < 10)
source helper.csh >& helper.mjo$cnt2 &
@ cnt2 = $cnt2 + 1
end
wait
grep GTMASSERT helper.mjo*
helper.csh
@ cnt = 0
while ($cnt < 10)
if (-e STOP) then
break
endif
$ydb_dist/yottadb -run tmp
if ($status) then
touch STOP
set exit_status = 1
break
endif
@ cnt = $cnt + 1
end
The above test also fails with the upstream GT.M release in that GT.M V6.3-014 fails the same way. I am positive GT.M V7.0-001 (latest GT.M release at this time) also will fail the same way. I did not have a build to verify that though.