Skip to content

[Pipeline] Fix tsigusr1 failures

Sam Habiel requested to merge shabiel/YDBAIM:pipeline-sigusr1 into master

We found that out of 100 runs, 15 fail in tsigusr1 test.

This test is very timing sensitive; and a loaded machine will make it act differently.

The test previously:

  1. Jobs off an indexing job
  2. Waits for the two child jobs of the main indexing job to show up via their globals (check every .01 seconds)
  3. Waits for global index to show up (check every .0001 seconds)
  4. Send SIGUSR1
  5. Wait for file from SIGUSR1 to show up (check every .01 seconds)
  6. Count the data in the index 6 times
  7. Verify that the counts keep going up.

Our failures are in step 7. After a lot of debugging, it turns out that processes receiving the interrupt may be suspended longer than the 6 milliseconds when the counts are being performed (one count each millisecond). And thus the assumption in step 7 is no longer true: we actually found out that the counts were the same, as if the AIM processes stopped indexing (but that was not true... a final count after indexing stopped showed that it continued).

The solution is to change the test from "verify that the counts keep going up" to "verify that the first count is less than the final count" (which is a new count now added after the indexing is done).

This brings us to the other major change: In both tsigusr1 and tsigusr2, we wait for the indexing process to finish. Previously, we didn't, and since we rely on the same data in tsigusr1 and tsigusr2, the data from tsigusr1 was actually bleeding into tsigusr2 test. However, the tsigusr2 was still producing a valid result, as it was interrupting an indexing in progress and stopping it without fail.

Edited by Sam Habiel

Merge request reports