Clarify how to handle Unicode Byte Order Mark (BOM) characters
A UTF-8 encoded text file may or may not begin with a Byte Order Mark (i.e., zero-width non-breaking space), depending on the software that produced the file.
If this is a text/gemini file, and the contents of the file are sent verbatim to a client, the client may fail to detect the type of the first line. This has been demonstrated to occur in some clients. Therefore, the protocol specification or best practices documentation should provide some guidance as to how to handle the BOM in a consistent manner.
The possible options that I can see are (in order of decreasing strictness):
-
The server MUST remove the BOM when serving UTF-8 content that begins with a BOM.[a client-side hash/checksum would be different than the original one] - The client MUST ignore the BOM when receiving UTF-8 content from the server that begins with a BOM.
- (Best practices) the client SHOULD ignore the BOM when receiving UTF-8 content from the server that begins with a BOM.
- Consider this out-of-scope for the protocol itself. Recommend that UTF-8 text/gemini should not include a BOM.
- Clients MUST always consider the U+FEFF as ZWS (or word joiner) even when it's present at the start of the file (see comment below).
Points to consider:
- From a user's point of view, presence of the BOM is often unclear as it is essentially an invisible character in the document.
- The BOM is meaningless for UTF-8 content, so it's safe to ignore it.
- Unicode libraries used by a server or a client may or may not ignore/strip the BOM automatically. To minimize this unpredictable behavior, mandating stripping the BOM as early as possible is preferrable so all other components/parties don't have to worry about it. Skipping the BOM would be an ugly wrinkle to mandate in all text/gemini parsers...
- Recommendation in RFC 3629:
A protocol SHOULD also forbid use of U+FEFF as a signature for those textual protocol elements for which the protocol provides character encoding identification mechanisms, when it is expected that implementations of the protocol will be in a position to always use the mechanisms properly.