Lightning-fast Windows threadvars!
(Well, this by no means suggests they are free, you’ll still want to cache pointers.)
Does what @FPK2 probably tried to do 16 years ago, but a little more in depth. At the cost of 40 lines of assembly and digging into the TEB structure, this makes SysRelocateThreadvar
as fast as theoretically possible; you can squeeze out the last drop of speed only by calling it directly instead of using CurrentTM
(that’s a proposal btw, it will have more obvious positive consequences, like not linking all of the threading into each application).
Speed up on my computer is 10 → <2 ns per threadvar
access. This observably speeds up applications that access a nontrivial amount of threadvars; as heap.inc
relies on threadvars, this means applications that often allocate and free memory; which is, as you may have guessed, virtually any application.
-
GetMem
andFreeMem
speed up by 30%. -
FPC as a whole speeds up by 2.5~3% when compiling my application, and when compiling itself, and even when compiling this (which does 140M calls to
GetMem
andFreeMem
, × 8 ns each = 1.1 seconds faster, / 45 seconds = 2.4%). FPCUpDeluxe compiler invocations before packages run in 80 seconds in total, theirGetMem
s andFreeMem
s take 10.3 s before and 7.8 s after, (10.3 − 7.8) / 80 = 3.1%. -
Benchmark from here speeds up (again) as follows:
x86-64 before after
Test 1: 5000000 done in 1.948 sec 1.555 sec
Test 2: 5000000 done in 1.475 sec 1.180 sec
Test 3: 5000000 done in 1.440 sec 1.201 sec
s := 'a' + s x10: 218 ns/call 196 ns/call
s := 'a' + s + 'a' x10: 334 ns/call 321 ns/call
i386
Test 1: 5000000 done in 1.747 sec 1.338 sec
Test 2: 5000000 done in 2.321 sec 1.955 sec
Test 3: 5000000 done in 1.353 sec 1.040 sec
s := 'a' + s x10: 184 ns/call 184 ns/call
s := 'a' + s + 'a' x10: 348 ns/call 334 ns/call
- Another funny example: this person mentioned his “thread-safe Mersenne Twister” implemented with threadvars, and its performance was abysmal, 180 / 85 Mb/s (x86-64 / i386). With this patch, it increases to 710 / 390 Mb/s. For reference, global variables are 1 / 1.45 Gb/s, and “correct” threadvar accessed through a cached pointer must now be very close.
Tested on Windows XP and Windows 10.
Historical version, absolutely pathetic.