"if item in fieldset" should avoid generating BT [mem]
## Summary On x86_64 (and maybe on i386), "if item in struct.fieldset" generates a BT [mem] for a 32-bit set, but it is 10 times slower than BT reg on modern CPUs. Smaller set use a register, and it should to the same for sets up to the register size. ## System Information - **Operating system:** ANY - **Processor architecture:** x86-64 (and maybe i386) - **Compiler version:** ANY ## Steps to reproduce For a set stored in a 32-bit variable/register: ``` type TMyItem = (s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15, s16, s17, s18, s19, s20); TMySet = set of TMyItem; TMyClass = class s: TMySet; end; procedure test(c: TMyClass; s: TMyItem); begin if s in c.s then exclude(c.s, s); ``` generates in -O3 ``` 00000000004BCCD0 4889F8 mov rax,rdi if s in c.s then 00000000004BCCD3 400FB6D6 movzx edx,sil 00000000004BCCD7 0FA35008 bt [rax+$08],edx 00000000004BCCDB 730A jnb +$0A exclude(c.s, s); 00000000004BCCDD 81E6FF000000 and esi,$000000FF 00000000004BCCE3 0FB37008 btr [rax+$08],esi end; 00000000004BCCE7 C3 ret ``` According to Agner Fog instruction tables, BT [mem],reg has a latency of 10 unfused micro-ops latency instead of 1 for BT reg,reg (here e.g. on recent Intel - it is a bit less worse on AMD, but still 7 instead of 1): ``` BT r,r/i 1 1 p06 1 0.5 BT m,r 10 10 5 BT m,i 2 2 p06 p23 0.5 ``` Which is a lot for a very common instruction. Note that for a constant/immediate item (`for s2 in c.s`) the compiler generates TEST [mem],imm instead, which has a latency of 1 cycle, and is shorter and optimal: ``` TEST r,r/i 1 1 p0156 1 0.25 TEST m,r/i 1 2 p0156 p23 1 0.5 ``` ## Possible fixes We could use a temporary register, as the compiler currently does e.g. with a set of 7 items: ``` 00000000004BCCD7 0FB64808 movzx ecx,byte ptr [rax+$08] 00000000004BCCDB 0FA3D1 bt ecx,edx ``` Which is fine if the item is in a register/variable. So for our 32-bit set, generates something like: ``` mov ecx,dword ptr [rax+$08] bt ecx,edx ``` We could just extend this optimization for all sets of size 1, 2, 4 (and 8 on CPU64) bytes. Of course, sets of more than the pointer size could continue to use BT [mem],reg. **Side note:** `include/exclude` could also benefit of using a transient register, but it is perhaps of less priority, since algorithms usually read the set more than writing it. So instead of: ``` 00000000004BCCDD 81E6FF000000 and esi,$000000FF 00000000004BCCE3 0FB37008 btr [rax+$08],esi ``` we could generate: ``` mov ecx,dword ptr [rax+$08] btr ecx,esi mov dword ptr [rax+$08],ecx ``` Since BTS/BTR has the same 10 unfused micro op-ops latency: ``` BTR BTS BTC r,r/i 1 1 p06 1 0.5 BTR BTS BTC m,r 10 11 5 BTR BTS BTC m,i 3 4 p06 p4 p23 1 ```
issue