Performance: Parallel stat measurements
While we can obviously run stat collection on different repos at the same time, often even running a single stat through 2k+ days of a repo takes ~5 minutes. The design goals of repotracer don't need speed, but it would be nice to have it run a little faster than that.
I had thought we could do eg 8 copies of the repo and have 8 workers each use one of the copies, but thought the copies would take up a lot of space.
Something I realized when trying local clones in !3 is that we could reuse the /.git objects, and just have 8 temp folders for each worker to ask git to "re-hydrate" each workers particular commit into. This would still lead to eg 8 copies of the repo, but just of the state of the tree at any given time. It might be a little expensive in disk during the work, and also at startup we'd be asking git to kind of 're-clone' the copies from scratch, but we could delete them after being done.
This should allow for parallelizing, possibly even to more than 1 worker per CPU core given how IO intensive the git operations can be.
I'm not planning on doing this immediately but this issue should track that progress.