Skip to content

heap.inc with incremental formatting and instant recycling of fixed chunks.

Rika requested to merge runewalsh/source:heap into main

Tried to rewrite heap.inc, mostly to get rid of this killer issue. Main changes are:

  • Fixed OS chunks don’t require expensive “formatting” before their use or expensive “cleanup” afterwards. This fundamentally eliminates the problem above; both freeing and reusing OS chunks have now only a trivial cost and are performed freely.

  • By default, fixed sizes are:

┌──── step = 16 ────┐┌─── step = 32 ────┐┌──── step = 48 ───┐┌ step 64 ┐
16, 32, 48, 64, 80, 96, 128, 160, 192, 224, 272, 320, 368, 416, 480, 544

Trunk uses constant step up to 512 + step: on 32-bit (step = 16),

16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528

and on 64-bit (step = 32),

32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512, 544

I think the 32-bit case is definitely too extreme, while 64-bit is not granulated enough for small sizes. In the absolutely worst case, my version might use 11% (64-bit, 320 / 288) ~ 14% (32-bit, 128 / 112) more memory, but in practice, my version is often better because of fewer sizes (32-bit) or better granulation of small sizes (64-bit). Mind you, each size is a separate collection of half-filled OS chunks; do you really want to differentiate between 432 and 448 with default MaxKeptOSChunks = 4?

At the same time, these sizes are easily adjustable, so I left the possibility to get the trunk behavior by setting FixedSizesVer = 2.

  • Header size is 4 bytes even on 64-bit; more than enough for fixed chunks (I thought about making it 3-byte...), and variable chunks store high 32 bits separately.

  • FreeMem on an orphaned OS chunk transfers the entire chunk to the calling thread. This is tailored to the following case: dead thread was a worker, and one that fired the FreeMem request is logically the new owner of its data, which I find extremely plausible.


Adding some QueryPerformanceCounters and incs gives these results (without inlines for fairness; measured until the final end.):

  • Individual FPC invocations with FPCUpDeluxe might look not that good at all, e.g.

ppc3.exe ... pp.pas

x86-64/win64                before       after

GetMem fixed:               0.48 s       0.47 s   (20.2M)
FreeMem fixed:              0.29 s       0.26 s   (20.2M)
GetMem var:                   90 ms        45 ms  (200~190K)
FreeMem var:                  13 ms        12 ms  (200~190K)
SysOSAlloc:                 1535         1292
SysOSFree:                  1208         1074
MaxHeapUsed:                290M         270M
MaxHeapSize:                295M         278M
remove_freed_fixed_chunks:  1.8M

i386/win32                  before       after

GetMem fixed:               0.37 s       0.41 s   (19.8M)
FreeMem fixed:              0.25 s       0.24 s   (19.8M)
GetMem var:                   38 ms        20 ms  (120K)
FreeMem var:                   7 ms         6 ms  (120K)
SysOSAlloc:                 1394          916
SysOSFree:                  1219          761
MaxHeapUsed:                160M         160M
MaxHeapSize:                168M         165M
remove_freed_fixed_chunks:    2M

Sums over all invocations (before packages; >120) look nicer though:

x86-64/win64                before       after
									  
GetMem fixed:                3.7 s        2.5 s  (195M~196M)
FreeMem fixed:               1.5 s        1.2 s  (195M~196M)
GetMem var:                 0.74 s       0.49 s  (2.8M~2.6M)
FreeMem var:                0.23 s       0.17 s  (2.8M~2.6M)
SysOSAlloc:                  59K          48K
SysOSFree:                   53K          44K
MaxHeapUsed:                6.5G         6.1G
MaxHeapSize:                6.8G         6.4G
remove_freed_fixed_chunks:   50M

i386/win32                  before       after

GetMem fixed:                2.5 s        2.2 s  (191M)
FreeMem fixed:               1.1 s        0.9 s  (191M)
GetMem var:                  309 ms       195 ms (1.4M)
FreeMem var:                  80 ms        55 ms (1.4M)
SysOSAlloc:                  60K          36K
SysOSFree:                   55K          32K
MaxHeapUsed:                3.7G         3.8G
MaxHeapSize:                  4G           4G
remove_freed_fixed_chunks:   50M

(remove_freed_fixed_chunks is the amount of iterations of its inner loop, or the amount of work wasted on reformatting fixed chunks.)

  • Results on ucd_pack.pas / ucd_pack_c.pas from !179. They inadvertently hit the #40447 case badly, but unlike #40447 are helped by increasing MaxKeptOSChunks even in trunk:

ucd_pack_c.pas

x86-64/win64                                before                          after

MaxKeptOSChunks =                4        8       12       16            4        8  
                                                                     
GetMem fixed:                 0.26 s    141 ms   141 ms   130 ms       153 ms   146 ms
FreeMem fixed:                 142 ms    96 ms    96 ms    92 ms        96 ms    92 ms
GetMem var:                     51 ms  ← -"-    ← -"-    ← -"-          33 ms  ← -"- 
FreeMem var:                    16 ms  ← -"-    ← -"-    ← -"-          15 ms  ← -"- 
SysOSAlloc:                   2070      542      167       64         3877       35  
SysOSFree:                    2057      524      142       34         3864       18  
MaxHeapUsed:                 17.6M     ← -"-    ← -"-    ← -"-       17.6M     ← -"- 
MaxHeapSize:                 19.3M    20.6M     ← -"-    ← -"-       19.2M     ← -"- 
remove_freed_fixed_chunks:    4.2M     915K      99K     9.3K

i386/win32                                  before                          after

MaxKeptOSChunks =                4        8       12       16            4        8  
                                                                     
GetMem fixed:                 0.86 s   0.49 s    142 ms   130 ms       161 ms   153 ms
FreeMem fixed:                0.35 s   0.22 s    102 ms    92 ms       110 ms   106 ms
GetMem var:                     46 ms  ← -"-    ← -"-    ← -"-          28 ms  ← -"- 
FreeMem var:                    15 ms  ← -"-    ← -"-    ← -"-          13 ms  ← -"- 
SysOSAlloc:                  27.6K     7076     1000       64         6337       33  
SysOSFree:                   27.6K     7059      980       34         6326       18  
MaxHeapUsed:                 12.9M     ← -"-    ← -"-    ← -"-       12.9M     ← -"- 
MaxHeapSize:                 14.5M    15.1M    16.3M     ← -"-       14.9M     ← -"- 
remove_freed_fixed_chunks:     45M    12.8M     814K     106K

ucd_pack.pas

x86-64/win64                        before                   after
												   
MaxKeptOSChunks =                4         8              4         8  
                                                      
GetMem fixed:                   14 ms     12 ms          17 ms     16 ms
FreeMem fixed:                  11 ms     10 ms          13 ms     11 ms
GetMem var:                     22 ms   ← -"-            16 ms     15 ms
FreeMem var:                    10 ms   ← -"-             9 ms      8 ms
SysOSAlloc:                    114        61           1670        32  
SysOSFree:                     101        44           1658        16  
MaxHeapUsed:                   16M      ← -"-           16M      ← -"- 
MaxHeapSize:                 17.6M     19.2M          17.9M      ← -"- 
remove_freed_fixed_chunks:   15.3K      5248

i386/win32                                   before                           after

MaxKeptOSChunks =                4         8        12       16            4        8  
                                                                       
GetMem fixed:                  0.5 s      54 ms     16 ms  ← -"-          18 ms    17 ms
FreeMem fixed:                 158 ms     24 ms     11 ms  ← -"-          15 ms    12 ms
GetMem var:                     24 ms   ← -"-     ← -"-    ← -"-          16 ms   ← -"- 
FreeMem var:                     9 ms   ← -"-     ← -"-    ← -"-           9 ms     8 ms
SysOSAlloc:                   9026      1230       351       63         2783        29  
SysOSFree:                    9013      1212       329       37         2773        15  
MaxHeapUsed:                 12.9M      ← -"-     ← -"-    ← -"-       12.9M      ← -"- 
MaxHeapSize:                 14.3M     14.9M     16.5M     ← -"-       14.6M      15.2M 
remove_freed_fixed_chunks:   11.8M      922K       84K     5971

I had some further ideas, like red-black tree for variable freelist and faster orphaning and adopting of variable chunks, but let’s start with this one.

Edited by Rika

Merge request reports