Add go benchmarks
The project needs to be in the
$GOPATH to build.
The go benchmarks are impressive: I get a max pause of ~0.30ms on my machine with go 1.7.1. That's opposed to ~60ms for Haskell on my machine (GHC 8.0.1). It's to be expected since the GC mostly runs concurrently, and has been specifically tuned for low latency.
See docs of the GC debug output here: https://golang.org/pkg/runtime/. I'm only
seding the STW wall clock times. Also my
sed skill are awful. No doubt there's a better way of extracting the desired numbers.
The changes were merged into master. The source branch has been removed.
Thanks! I will not merge right away as I would like to test it locally first, but the code looks straightforward and I'm happy to host more languages.
Did you experiment with older Go versions? I know there has been a lot of churn in their GC algorithms and otherwise runtime choices over the last few releases, so it would be interesting if an older version behaved differently.
I have not experimented with different versions of go. I know there has been a lot of improvement in GC pause times in recent versions as is shown in these slides. I don't have time to do it now, but I'm planning on writing some of this stuff up in a blog post so I'll have a go at it then.
I tried on my machine (Go 1.6), the PR works as expected (I get 1-2ms pause times) but I had to make the two following changes:
- in your
rm -f mainotherwise it fails on a clean repo.
gotarget depends on
mainwhich does not exist in a clean build; is
main.gothe correct fix?
Could you clean that up before I merge?
I see that you rely on the GODEBUG measurements for maximal latency. There is a risk with that methodology, that there would be other latencies in the runtime that contribute to the maximum latencies but are not accounted inside GC pauses. For this particular benchmark, I found that a robust way to tell if that was the case was to tweak the program to compute the maximum time elapsed between two iterations (by asking for the current time after each operation, and maintaining a mutable reference containing the maximum time difference) -- for OCaml and Racket this gives exactly the same results as using the GC reports, and this is why I trust the GC reports. Whenever you decide to invest more work into this, could you maybe check that this also works for the Go program? If the two measurements differ noticeably, we may have an issue.
(Edit: I initially suggested printing each iteration time and measuring the max pause externally, but I think this may add more noise to the measure than in-program computating of the max.)
- in your
Thanks for looking over it. Apologies for the silly mistakes in the Makefile. These should now be fixed.
Great idea with measuring times for each individual push; it did need a sanity check. I've pushed this in the latest commit. As it turns out, I'm getting worst times of ~35ms! I'm not sure what the big discrepancy is. I'll look into it tomorrow -- let's hold off merging this for now until I've had a chance to look into it.
Maybe I should actually convert the other benchmarks to use the manual time-measuring trick instead of GC settings. It would make it easier to port to other languages in a known-to-be-reliable way. (That said, then we might get into the business of measuring constant-time vs. amortized associative map implementations.) I didn't do it initially as I started from the Haskell benchmark where this makes the codebase conceptually heavier (you need IO and possibly State), and wanted to keep things as elegant as possible.
OK, I believe I've figured out what is happening: I noticed that the time I'm manually calculating matched up closely with maximum concurrent phase of a GC. I also noticed that it was the call to
mkMessagethat caused one of these pauses that appeared to be making the concurrent phase of the GC STW. My theory is that the heuristics the GC uses to figure out when to run are leaving it "too late": the heap fills up, and when a call to
mkMessageleads to a new allocation, it must run a GC, but the mutator is blocked during the concurrent phase because the heap is full. If I force the GC to run at regular intervals, the latencies go right down to <1ms, which supports that hypothesis. What isn't clear is why more memory cannot be allocated as soon as enough has been freed in the concurrent phase. I'll create a SO question about that.
That sounds rather similar (in result, not in the technical details) to the Racket code that can also be tweaked, by adding extra GC calls, to be more incremental (that was mentioned in my blog post). What's the throughput overhead of those GC calls? In any case, I think it would be more interesting to report latencies of the straightforward implementation in the benchmark base, and maybe have the tuned version available conditionally (off by default). Tunability is an important point, but the natural-looking code is what people will actually write at first.
Interesting. It does sound similar to the Racket case. And I agree that having a "vanilla" version is most important, with a possibly optimised one as optional. I'll have a go at that tomorrow.
I think in general we need to check both the times reported by RTS debug instrumentation and by manually timing each iteration, and check if they look roughly the same. That's because I worry that manually timing each iteration may hit some kind of worst case insert into an amortised O(1)/O(log n) map, as you mentioned. If there's a discrepancy in the two times, we need to investigate whether it is the GC or the data structure.
One thing we could try is to switch to just a plain old array of
window_sizeentries, inserting at
index % window_size(no need to explicitly delete then), which would make it clear that the only latency costs are in the GC (or that your language support for arrays is broken). When I have time I'll try to see if we could do this for the previous languages and have relatively stable results.
OK, I've switched to using an array instead of a map. It caused the manually measured worst case pause to go down to ~7ms. My hypothesis is that this was being caused by https://github.com/golang/go/issues/10345 which is a generalisation of https://github.com/golang/go/issues/9477 which is caused by large
maps. This was suggested as a possible cause when I submitted the same question to the golang-nuts mailing list. Also the map version of the benchmark drops to similar worst case latencies when I run the HEAD version of the compiler.
7ms is still higher than the STW pause times. It was suggested it could be https://github.com/golang/go/issues/16528. I had a look in the trace tool, and it does look like the the mutator gets blocked on all threads at regular intervals (it get's blocked by as much as ~15ms when the tracer is running). Hopefully this gets resolved in Go1.8.
Is this good to merge for now at least? That's as the non-manually tuned version.
You're right that it would be nice to switch to mutable arrays for the other languages. I could have a crack at Haskell.
This is very nice, thanks.
(I liked the old world where the results seemed rather independent of the associative datastructure used (map, mutable or not, or hash-table, etc.), but I guess this is the price to pay for generalization.)
Status changed to mergedToggle commit list
FYI, I wrote up our findings in this blog post here: https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/
It'll be interesting to see how this followup issue progresses: https://github.com/golang/go/issues/18155