Looking at the log of the failed build, there's a two-second time delta on the metadata of the file. There should be none, so something is certainly amiss.
Don't run tests that need filesystem mtimes under Docker.
Don't use mtime, and use something like checksum of inputs instead. This would require storing the checksums somewhere between runs. Not sure where that would be.
Always generate new output, drop the "don't genereate output unless needed" functionality.
We did it that way (source >= output) in order to cope with low precision timestamps and rapid scripted input changes.
If we insert a sleep before the running of docgen the first time, that'll ensure the timestamp of the output document is larger than the input timestamps, which might make things more reliable perhaps?