Lightning-fast Windows threadvars! (!689) · Merge requests · FPC / FPC / FPC Source

(Well, this by no means suggests they are free, you’ll still want to cache pointers.)

Does what @FPK2 probably tried to do 16 years ago, but a little more in depth. At the cost of 40 lines of assembly and digging into the TEB structure, this makes SysRelocateThreadvar as fast as theoretically possible; you can squeeze out the last drop of speed only by calling it directly instead of using CurrentTM (that’s a proposal btw, it will have more obvious positive consequences, like not linking all of the threading into each application).

Speed up on my computer is 10 → <2 ns per threadvar access. This observably speeds up applications that access a nontrivial amount of threadvars; as heap.inc relies on threadvars, this means applications that often allocate and free memory; which is, as you may have guessed, virtually any application.

GetMem and FreeMem speed up by 30%.
FPC as a whole speeds up by 2.5~3% when compiling my application, and when compiling itself, and even when compiling this (which does 140M calls to GetMem and FreeMem, × 8 ns each = 1.1 seconds faster, / 45 seconds = 2.4%). FPCUpDeluxe compiler invocations before packages run in 80 seconds in total, their GetMems and FreeMems take 10.3 s before and 7.8 s after, (10.3 − 7.8) / 80 = 3.1%.
Benchmark from here speeds up (again) as follows:

x86-64                        before          after
Test 1: 5000000 done in     1.948 sec       1.555 sec
Test 2: 5000000 done in     1.475 sec       1.180 sec
Test 3: 5000000 done in     1.440 sec       1.201 sec
s := 'a' + s x10:             218 ns/call     196 ns/call
s := 'a' + s + 'a' x10:       334 ns/call     321 ns/call
						 				  
i386                                       
Test 1: 5000000 done in     1.747 sec       1.338 sec
Test 2: 5000000 done in     2.321 sec       1.955 sec
Test 3: 5000000 done in     1.353 sec       1.040 sec
s := 'a' + s x10:             184 ns/call     184 ns/call
s := 'a' + s + 'a' x10:       348 ns/call     334 ns/call

Another funny example: this person mentioned his “thread-safe Mersenne Twister” implemented with threadvars, and its performance was abysmal, 180 / 85 Mb/s (x86-64 / i386). With this patch, it increases to 710 / 390 Mb/s. For reference, global variables are 1 / 1.45 Gb/s, and “correct” threadvar accessed through a cached pointer must now be very close.

Tested on Windows XP and Windows 10. 😇

Historical version, absolutely pathetic.

Lightning-fast Windows threadvars!

Merge request reports