Decouple fetch and collect steps in the content aggregator
The content aggregator performs two steps on each remote (managed) repository:
- It fetches the git index from the git server if it hasn't yet been pulled down or the fetch option is enabled.
- It walks the git tree for each reference and collects the files into an aggregate.
Currently, these steps are performed concurrently (using Promise.all with a limit). This is done as an optimization to make use of dead time due to network latency. However, this behavior has recently become a problem when using repositories on GitHub. As of late, GitHub is aggressively resetting a connection if it is paused even for a rather brief amount of time. This pause occurs when Node.js context switches from one concurrent operation to another. If a collect operation takes longer than whatever this limit is, the active connection may be reset by the time Node.js returns to process it, thus resulting in a connection error.
The short term workaround is to turn off this concurrency in the content aggregator. That can be achieved using the following setting in the playbook:
git:
fetch_concurrency: 1
However, it may still be possible to use concurrency in the content aggregator if the only concurrent operations are the fetches themselves. That's because a requests tend to context switch more frequently, keeping any pauses under the server timeout. In contrast, the walking of a git tree and reading and collecting the files can take considerably longer and be much more likely to give up control of the thread (less frequent context switches). There's still some risk that using concurrency only for the fetches will still result in a connection error, but it's much less likely. Regardless, the git.fetch_concurrency
should still be honored.
The other reason this change is advantageous is because it reduces confusion with how errors are reported in this phase. Currently, errors have to be held until all operations have finished to avoid corrupting a repository. But since it's collecting concurrently, this can result in a lot of work before Antora fails due to a network error. Decoupling these steps will allow Antora to report errors sooner.
It's probably not a risk to use maximum concurrency to collect files once the fetches are complete since there are no more network connections involved. However, we may need to consider allowing that concurrency limit to be controlled as well in the future.