Over the past month (at least, maybe two), we have been having curious build timeouts, where out of two builds with the same compiler and source revision, one will timeout and the other will succeed. Is there something going on in the backend? For example, https://builds.tuxbuild.com/20c9r8xZ2zlODkKcO92WuHmvWLg/ timed out but https://builds.tuxbuild.com/20cn4xZ4nDlIF8Q4pSsTUGwHdIJ/ succeeded; given there was only four hours between them (same source revision) and clang was last built on October 27th (from a revision on June 28th), I would expect the results to be the exact same. As a side note, it would be nice to have metadata.json for timed out builds as well so we can see the exact compiler version string.
I have compared the 2 builds. Looks like the timeout happened after 3 hrs boundary but the build was nearing the completion stage i.e tarring all the modules. It could just be that the build ran on a slower machine. We will look into recent timeouts to see if this is the pattern.
@nathanchance We have agreed to increase the build timeout from 3hrs to 4hrs since these builds are taking more than 3 hrs to finish. We also plan to add timestamp in tuxmake logs to reflect where the build actually stopped executing.
$ tuxmake -a arm64 -k allmodconfig -K CONFIG_WERROR=n -r podman -t clang-12 CROSS_COMPILE_COMPAT=arm-linux-gnueabihf- LLVM=1 LLVM_IAS=1...I: config: PASS in 0:00:07.691308I: default: PASS in 0:13:43.800073I: kernel: PASS in 0:00:05.664149I: xipkernel: SKIP in 0:00:00.293470I: modules: PASS in 0:07:28.587888I: dtbs: PASS in 0:00:08.067669I: dtbs-legacy: SKIP in 0:00:00.390786I: debugkernel: PASS in 0:00:11.339249I: headers: PASS in 0:00:02.748245...
If I am reading that right, that means the kernel build took 13m 43s seconds and installing then archiving the modules took 7m 28s (so 33% of the total tuxmake time was spent archiving the modules). That seems excessive to me, perhaps tuxmake#161 (closed) is relevant here? When I benchmarked the compression algorithms a year and a half ago, xz was significantly worse than all other algorithms when it comes to {,de}compression speed even though it offered the best compression ratio. Perhaps switching to an algorithm like zstd might be worth exploring, as it offers better {,de}compression speed than xz with a comparable compression ratio, meaning more builds within a set timespan.
as I commented in tuxmake#161 (closed), we could play around with the different xz compression levels.
e.g. on some simple local test with an arm64 modules tarball:
default compression level:
$ time xz --keep modules.tar real 0m16,839suser 0m16,781ssys 0m0,048stmp$ ll modules.tar*-rw-r--r-- 1 terceiro terceiro 7,5M nov 19 13:55 modules.tar.xz-rw-r--r-- 1 terceiro terceiro 43M nov 19 13:55 modules.tar
Lowest compression level (faster):
$ time xz --keep -0 modules.tar real 0m1,855suser 0m1,827ssys 0m0,028stmp$ ll modules.tar*-rw-r--r-- 1 terceiro terceiro 8,8M nov 19 13:55 modules.tar.xz-rw-r--r-- 1 terceiro terceiro 43M nov 19 13:55 modules.tar
In this case -0 is 1 order of magnitude faster than the default, and the resulting compression is not bad: 20% or the original size, vs 17% with the default compression.
For reference, the build that produced that modules tarball took 21s in the modules target.
well just making a search and replace would be easy, but I would not want to be locked in to a new compressor and rather have a mechanism to choose. Also switching compression methods might break automation that people might have in place (looking for *.xz).
Something that would help a lot without any change, right now, would be setting XZ_OPT=-0 in the environment for the tuxmake build in the TuxBuild backend.
Based on the discussions on this issue, we have implemented the following which will help us to improve / neglect seeing long running builds:
Build timeout increased to 240 minutes
build_debug.log published with time stamped lines to see how long each step took (build.log is still available as before)
tuxbuild uses xz level 0 compression to compress the modules instead of the default level 6 which (may) help in reducing the overall time taken for large builds
compression time should drop by 1 order of magnitude
This is one of the longest build https://builds.tuxbuild.com/21OB64WZb0jESRxrm2s67TRLirm/ which took about 2 hrs 40 minutes to finish today. Looking at the logs the module tarball creation with latest optimization took 10 minutes to finish. It looks like tarball creation also contributed the timeout.
It appears the changes implemented have generally helped, as I see a lot fewer timeouts (most of our recent failures have been due to GitHub Actions). However, this build is rather peculiar because it says that it timed out on the build step but it produced an Image.gz, which appears to boot and work fine in QEMU. Did something go wrong on the backend? Additionally, the result that we pulled down says the result is "unknown" and that it is running. Was there a race with the tuxsuite client finishing the build and our CI trying to get the result?
@nathanchance this build timed out while building modules and hence you have the Image.gz but no modules. We will increase the timeout marginally so that the builds don't timeout.
@nathanchance would like to get your answers on the following:
Is there a possibility of toolchain getting into a loop causing the build to get hung? Have you observed such issues in the past?
How can we check if clang toolchain process is functioning while doing the build?
We have tried debugging on a live machine that is running build taking more than 3 hours and saw that builds cached hangs sometime and un-cached builds such as the ones using clang nightly toolchain which do not use sccache also hangs, in different machine configurations. Hence the above questions which will be helpful for us to debug on a live machine when the build happens and find out the root cause.
@stylesen I was just getting ready to make a comment around sccache because I saw this build timed out, even though metadata.json showed that the build was mostly cached (21,520 hits, 12 misses). If uncached builds hang though, it might not be that.
If you are debugging this while the build is happening in real time, I usually use htop to see what file clang is building when it hangs so that I can try to build the single file (make <path>.o) and see if it hangs that way. I am not really sure of a way to do this programmatically... it is possible you could use the etime field from ps to get the elapsed time of all clang invocations and if one is longer than say, a minute, you kill the build and report the unit that hung?
If clang-nightly builds are hanging without sccache and those hangs are reproducible on many different machines, let us know because that is likely an issue with upstream so we need reduce down the problematic file and get it reported so it can be fixed before the compiler releases. If the hangs are not reproducible from machine to machine, that is going to be harder to figure out what is going on (it is likely not the compiler at that point).
@nathanchance I have logged into a machine where the build was running for 3 hrs and check what is going on. I see that lld was running for more than 3 hrs. Here is the process:
@vishalbhoj would it be possible to get the .o files that it mentions in those commands? I cannot reproduce locally with tuxmake, it might be a caching problem?
@nathanchance I don't think it is a problem related to caching. We have seen this behaviour on clang-nightly based build as well which doesn't use cache. The issue is not reproducible as the same build when resubmitted finishes fine.
Hmmm, I don't, although it seems like there might be a race condition somewhere as the traceback says it is waiting for something? Perhaps @nickdesaulniers has an idea? The fact that it is not reproducible with the exact same build artifacts is really troubling, as that seems to point to some sort of non-determinism. You said these timeouts happen on various different machines, correct? Is there anything common between them, such as CPU, memory, or hard drive type?