"if item in fieldset" should avoid generating BT [mem]
## Summary
On x86_64 (and maybe on i386), "if item in struct.fieldset" generates a BT [mem] for a 32-bit set, but it is 10 times slower than BT reg on modern CPUs.
Smaller set use a register, and it should to the same for sets up to the register size.
## System Information
- **Operating system:** ANY
- **Processor architecture:** x86-64 (and maybe i386)
- **Compiler version:** ANY
## Steps to reproduce
For a set stored in a 32-bit variable/register:
```
type
TMyItem = (s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15, s16, s17, s18, s19, s20);
TMySet = set of TMyItem;
TMyClass = class
s: TMySet;
end;
procedure test(c: TMyClass; s: TMyItem);
begin
if s in c.s then
exclude(c.s, s);
```
generates in -O3
```
00000000004BCCD0 4889F8 mov rax,rdi
if s in c.s then
00000000004BCCD3 400FB6D6 movzx edx,sil
00000000004BCCD7 0FA35008 bt [rax+$08],edx
00000000004BCCDB 730A jnb +$0A
exclude(c.s, s);
00000000004BCCDD 81E6FF000000 and esi,$000000FF
00000000004BCCE3 0FB37008 btr [rax+$08],esi
end;
00000000004BCCE7 C3 ret
```
According to Agner Fog instruction tables, BT [mem],reg has a latency of 10 unfused micro-ops latency instead of 1 for BT reg,reg (here e.g. on recent Intel - it is a bit less worse on AMD, but still 7 instead of 1):
```
BT r,r/i 1 1 p06 1 0.5
BT m,r 10 10 5
BT m,i 2 2 p06 p23 0.5
```
Which is a lot for a very common instruction.
Note that for a constant/immediate item (`for s2 in c.s`) the compiler generates TEST [mem],imm instead, which has a latency of 1 cycle, and is shorter and optimal:
```
TEST r,r/i 1 1 p0156 1 0.25
TEST m,r/i 1 2 p0156 p23 1 0.5
```
## Possible fixes
We could use a temporary register, as the compiler currently does e.g. with a set of 7 items:
```
00000000004BCCD7 0FB64808 movzx ecx,byte ptr [rax+$08]
00000000004BCCDB 0FA3D1 bt ecx,edx
```
Which is fine if the item is in a register/variable.
So for our 32-bit set, generates something like:
```
mov ecx,dword ptr [rax+$08]
bt ecx,edx
```
We could just extend this optimization for all sets of size 1, 2, 4 (and 8 on CPU64) bytes.
Of course, sets of more than the pointer size could continue to use BT [mem],reg.
**Side note:**
`include/exclude` could also benefit of using a transient register, but it is perhaps of less priority, since algorithms usually read the set more than writing it.
So instead of:
```
00000000004BCCDD 81E6FF000000 and esi,$000000FF
00000000004BCCE3 0FB37008 btr [rax+$08],esi
```
we could generate:
```
mov ecx,dword ptr [rax+$08]
btr ecx,esi
mov dword ptr [rax+$08],ecx
```
Since BTS/BTR has the same 10 unfused micro op-ops latency:
```
BTR BTS BTC r,r/i 1 1 p06 1 0.5
BTR BTS BTC m,r 10 11 5
BTR BTS BTC m,i 3 4 p06 p4 p23 1
```
issue