CI live job log delay
We're running self hosted Omnibus installation of GitLab. A single instance, with locally installed GitLab Runner. The VM has 16 cores and 32GB of RAM, so the resource allocation is not an issue.
For any running CI job, the live log output can be delayed anywhere from a few seconds to tens of seconds, and it's not related to the amount of data logged by the job. This leads to users bashing the F5 button, hoping to get some output (not really happening).
For example, for jobs that build a Docker image, the log is usually stuck at the first few lines where the DinD is starting:
Running with gitlab-runner 14.2.0 (58ba2b95)
on Docker-in-Docker runner RaUqcUVg
Resolving secrets 00:00
Preparing the "docker" executor 00:06
Using Docker executor with image docker:20.10.6 ...
Starting service docker:20.10.6-dind ...
Using locally found image version due to "if-not-present" pull policy
Using docker image sha256:dc8c389414c80f3c6510d3690cd03c29fc99d66f58955f138248499a34186bfa for docker:20.10.6-dind with digest docker@sha256:aa1e1fab842a35f16cf6e8fcb82b77b29be46f6aa8cd0fe69db0908940256be7 ...
Waiting for services to be up and running...
Then, 30-45 seconds later, 100+ lines of the docker build command appear and the log is stuck again at something like:
8ed16eb05fcd: Preparing
9bb2782fa907: Preparing
3595095e4a8a: Preparing
beee9f30bc1f: Preparing
8ed16eb05fcd: Waiting
9bb2782fa907: Waiting
3595095e4a8a: Waiting
beee9f30bc1f: Waiting
37b2ecfeb183: Waiting
8a5619c5c487: Waiting
64f3d9ebca72: Pushed
b43e54e85876: Pushed
444493f7f64a: Pushed
Then, it is stuck again until a few seconds after the job is actually complete for the log to be complete.
The same also happens for other types of jobs, like compiling C# or Java applications, or the jobs that deploy applications. On short running jobs, or jobs that output just a few lines, what can happen is that the log can be completely empty, and then suddenly appear in full only after the job is already complete.
Referring to the job logs documentation, I've narrowed down the problem to the first stage of the log data flow, somewhere in the "Runner => Puma => file storage" pipeline, because the temporary log file reflects the delays seen on the web. I couldn't find anything in the documentation or here in the issues that would describe this problem. I suspect that it has something to do with buffering/caching either in the Runner or Puma, but I couldn't find anything that would cover configuring that.
The same document talks about "Incremental logging architecture" but I have some problems understanding that part:
- It doesn't really describe if enabling it would allow true real-time trace of the job log
- If it does, setting up object storage infrastructure just to enable faster job logging looks like a massive overkill
Is there anything that can be done to reduce the delays in showing logs with the default log storage? I don't care if it would increase the disk writes or resource usage as this is a low-load deployment in an oversized VM that could handle increased polling and flushing of the runners.