heap.inc with incremental formatting and instant recycling of fixed chunks.
Tried to rewrite heap.inc
, mostly to get rid of this killer issue. Main changes are:
-
Fixed OS chunks don’t require expensive “formatting” before their use or expensive “cleanup” afterwards. This fundamentally eliminates the problem above; both freeing and reusing OS chunks have now only a trivial cost and are performed freely.
-
By default, fixed sizes are:
┌──── step = 16 ────┐┌─── step = 32 ────┐┌──── step = 48 ───┐┌ step 64 ┐
16, 32, 48, 64, 80, 96, 128, 160, 192, 224, 272, 320, 368, 416, 480, 544
Trunk uses constant step
up to 512 + step
: on 32-bit (step
= 16),
16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528
and on 64-bit (step
= 32),
32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544
I think the 32-bit case is definitely too extreme, while 64-bit is not granulated enough for small sizes. In the absolutely worst case, my version might use 11% (64-bit, 320 / 288) ~ 14% (32-bit, 128 / 112) more memory, but in practice, my version is often better because of fewer sizes (32-bit) or better granulation of small sizes (64-bit). Mind you, each size is a separate collection of half-filled OS chunks; do you really want to differentiate between 432 and 448 with default MaxKeptOSChunks = 4
?
At the same time, these sizes are easily adjustable, so I left the possibility to get the trunk behavior by setting FixedSizesVer = 2
.
-
Header size is 4 bytes even on 64-bit; more than enough for fixed chunks (I thought about making it 3-byte...), and variable chunks store high 32 bits separately.
-
FreeMem
on an orphaned OS chunk transfers the entire chunk to the calling thread. This is tailored to the following case: dead thread was a worker, and one that fired theFreeMem
request is logically the new owner of its data, which I find extremely plausible.
Adding some QueryPerformanceCounter
s and inc
s gives these results (without inline
s for fairness; measured until the final end.
):
- Individual FPC invocations with FPCUpDeluxe might look not that good at all, e.g.
ppc3.exe ... pp.pas
x86-64/win64 before after
GetMem fixed: 0.48 s 0.47 s (20.2M)
FreeMem fixed: 0.29 s 0.26 s (20.2M)
GetMem var: 90 ms 45 ms (200~190K)
FreeMem var: 13 ms 12 ms (200~190K)
SysOSAlloc: 1535 1292
SysOSFree: 1208 1074
MaxHeapUsed: 290M 270M
MaxHeapSize: 295M 278M
remove_freed_fixed_chunks: 1.8M
i386/win32 before after
GetMem fixed: 0.37 s 0.41 s (19.8M)
FreeMem fixed: 0.25 s 0.24 s (19.8M)
GetMem var: 38 ms 20 ms (120K)
FreeMem var: 7 ms 6 ms (120K)
SysOSAlloc: 1394 916
SysOSFree: 1219 761
MaxHeapUsed: 160M 160M
MaxHeapSize: 168M 165M
remove_freed_fixed_chunks: 2M
Sums over all invocations (before packages; >120) look nicer though:
x86-64/win64 before after
GetMem fixed: 3.7 s 2.5 s (195M~196M)
FreeMem fixed: 1.5 s 1.2 s (195M~196M)
GetMem var: 0.74 s 0.49 s (2.8M~2.6M)
FreeMem var: 0.23 s 0.17 s (2.8M~2.6M)
SysOSAlloc: 59K 48K
SysOSFree: 53K 44K
MaxHeapUsed: 6.5G 6.1G
MaxHeapSize: 6.8G 6.4G
remove_freed_fixed_chunks: 50M
i386/win32 before after
GetMem fixed: 2.5 s 2.2 s (191M)
FreeMem fixed: 1.1 s 0.9 s (191M)
GetMem var: 309 ms 195 ms (1.4M)
FreeMem var: 80 ms 55 ms (1.4M)
SysOSAlloc: 60K 36K
SysOSFree: 55K 32K
MaxHeapUsed: 3.7G 3.8G
MaxHeapSize: 4G 4G
remove_freed_fixed_chunks: 50M
(remove_freed_fixed_chunks
is the amount of iterations of its inner loop, or the amount of work wasted on reformatting fixed chunks.)
- Results on
ucd_pack.pas
/ucd_pack_c.pas
from !179. They inadvertently hit the #40447 case badly, but unlike #40447 are helped by increasingMaxKeptOSChunks
even in trunk:
ucd_pack_c.pas
x86-64/win64 before after
MaxKeptOSChunks = 4 8 12 16 4 8
GetMem fixed: 0.26 s 141 ms 141 ms 130 ms 153 ms 146 ms
FreeMem fixed: 142 ms 96 ms 96 ms 92 ms 96 ms 92 ms
GetMem var: 51 ms ← -"- ← -"- ← -"- 33 ms ← -"-
FreeMem var: 16 ms ← -"- ← -"- ← -"- 15 ms ← -"-
SysOSAlloc: 2070 542 167 64 3877 35
SysOSFree: 2057 524 142 34 3864 18
MaxHeapUsed: 17.6M ← -"- ← -"- ← -"- 17.6M ← -"-
MaxHeapSize: 19.3M 20.6M ← -"- ← -"- 19.2M ← -"-
remove_freed_fixed_chunks: 4.2M 915K 99K 9.3K
i386/win32 before after
MaxKeptOSChunks = 4 8 12 16 4 8
GetMem fixed: 0.86 s 0.49 s 142 ms 130 ms 161 ms 153 ms
FreeMem fixed: 0.35 s 0.22 s 102 ms 92 ms 110 ms 106 ms
GetMem var: 46 ms ← -"- ← -"- ← -"- 28 ms ← -"-
FreeMem var: 15 ms ← -"- ← -"- ← -"- 13 ms ← -"-
SysOSAlloc: 27.6K 7076 1000 64 6337 33
SysOSFree: 27.6K 7059 980 34 6326 18
MaxHeapUsed: 12.9M ← -"- ← -"- ← -"- 12.9M ← -"-
MaxHeapSize: 14.5M 15.1M 16.3M ← -"- 14.9M ← -"-
remove_freed_fixed_chunks: 45M 12.8M 814K 106K
ucd_pack.pas
x86-64/win64 before after
MaxKeptOSChunks = 4 8 4 8
GetMem fixed: 14 ms 12 ms 17 ms 16 ms
FreeMem fixed: 11 ms 10 ms 13 ms 11 ms
GetMem var: 22 ms ← -"- 16 ms 15 ms
FreeMem var: 10 ms ← -"- 9 ms 8 ms
SysOSAlloc: 114 61 1670 32
SysOSFree: 101 44 1658 16
MaxHeapUsed: 16M ← -"- 16M ← -"-
MaxHeapSize: 17.6M 19.2M 17.9M ← -"-
remove_freed_fixed_chunks: 15.3K 5248
i386/win32 before after
MaxKeptOSChunks = 4 8 12 16 4 8
GetMem fixed: 0.5 s 54 ms 16 ms ← -"- 18 ms 17 ms
FreeMem fixed: 158 ms 24 ms 11 ms ← -"- 15 ms 12 ms
GetMem var: 24 ms ← -"- ← -"- ← -"- 16 ms ← -"-
FreeMem var: 9 ms ← -"- ← -"- ← -"- 9 ms 8 ms
SysOSAlloc: 9026 1230 351 63 2783 29
SysOSFree: 9013 1212 329 37 2773 15
MaxHeapUsed: 12.9M ← -"- ← -"- ← -"- 12.9M ← -"-
MaxHeapSize: 14.3M 14.9M 16.5M ← -"- 14.6M 15.2M
remove_freed_fixed_chunks: 11.8M 922K 84K 5971
I had some further ideas, like red-black tree for variable freelist and faster orphaning and adopting of variable chunks, but let’s start with this one.