[Pipeline] Fix tsigusr1 failures (!67) · Merge requests · YottaDB / Util / YDBAIM

Sam Habiel requested to merge shabiel/YDBAIM:pipeline-sigusr1 into master Apr 24, 2023

We found that out of 100 runs, 15 fail in tsigusr1 test.

This test is very timing sensitive; and a loaded machine will make it act differently.

The test previously:

Jobs off an indexing job
Waits for the two child jobs of the main indexing job to show up via their globals (check every .01 seconds)
Waits for global index to show up (check every .0001 seconds)
Send SIGUSR1
Wait for file from SIGUSR1 to show up (check every .01 seconds)
Count the data in the index 6 times
Verify that the counts keep going up.

Our failures are in step 7. After a lot of debugging, it turns out that processes receiving the interrupt may be suspended longer than the 6 milliseconds when the counts are being performed (one count each millisecond). And thus the assumption in step 7 is no longer true: we actually found out that the counts were the same, as if the AIM processes stopped indexing (but that was not true... a final count after indexing stopped showed that it continued).

The solution is to change the test from "verify that the counts keep going up" to "verify that the first count is less than the final count" (which is a new count now added after the indexing is done).

This brings us to the other major change: In both tsigusr1 and tsigusr2, we wait for the indexing process to finish. Previously, we didn't, and since we rely on the same data in tsigusr1 and tsigusr2, the data from tsigusr1 was actually bleeding into tsigusr2 test. However, the tsigusr2 was still producing a valid result, as it was interrupting an indexing in progress and stopping it without fail.

Edited Apr 25, 2023 by Sam Habiel

Admin message

[Pipeline] Fix tsigusr1 failures

Merge request reports