reference counted strings put the HTML bit in the middle of the reference count
This is something I noticed a long time ago and has been on my list of things to fix, but I thought I should actually document the situation.
lib/cgraph/refstr.c implements reference-counted strings with the structure refstr_t
. Since the initial commit of cgraph, d7767d4b, HTML_BIT
has been (IMHO incorrectly) set based on unsigned int
instead of unsigned long
. Presumably this was not an issue because most machines were 32-bit x86 at the time, where sizeof(unsigned int) == sizeof(unsigned long)
.
In the transition to x86-64 machines, this setup became awkward. The reference count is 8 bytes wide, but HTML_BIT
is set to be bit 31. I.e. the bit indicating that a string is HTML is now in the middle of the reference count. A string with ≥ 2³¹ references, something that is possible on an x86-64 machine, now accidentally sets the HTML bit. Note that this ecosystem transition also made the comment about HTML_BIT
incorrect, /* msbit of unsigned long */
.
d1244c80 did a tree-wide replacement of unsigned long
with uint64_t
, which further compounded the situation. Now the HTML bit is in the incorrect position on both 32-bit and 64-bit x86.
It's debatable whether this bug has any effect or is latent, because it's effectively masked by another bug, !1857 (merged). Nevertheless I think we should fix it.
The cleanest solution is to stop storing two things in one field. We can use a bitfield to implement the correct semantics in a safer way, making the compiler do the shifting and masking for us:
uint64_t refcnt: sizeof(uint64_t) * 8 - 1;
uint64_t is_html: 1;
It is questionable whether this even needs to be a bitfield, and not simply a uint64_t
and a bool
, but I think we should leave this space optimization in place for now.