mbrtowc UTF-8 decoding and invalid sequences

Check Sortix for mbrtowc calls that can go beyond the end of a buffer if an invalid UTF-8 sequence ends with a zero byte, where the zero byte is treated as part of the invalid sequence and then continue. Possibly the right method is to memset ps to zero, then if the char wasn't the first of a sequence, try restarting mbrtowc from that character onwards. Inspect all Sortix code and possibly convert to this pattern. Without these changes, there's both the bug of dropping the next valid character in case of an invalid UTF-8 sequence and a risk of buffer overflow if a zero byte is in an invalid UTF-8 sequence. This ticket only applies to cases where the decoding tries to recover in the event of an invalid sequence, rather than erroring out. It might be good to make the insertion of 0xFFFD REPLACEMENT CHARACTER consistent while here.

Admin message

mbrtowc UTF-8 decoding and invalid sequences