Handle multilingual content (xml:lang) correctly
After some discussion in #78 (closed), there were debates on the chan on how to handle the xml:lang
attribute.
As per RFC-6120, the XMPP stream SHOULD
contain a language. This could be useful for indicating which languages you can accept for receiving automated responses (eg. error messages or bot messages).
However, i could not find information about xml:lang
being required on <message>
stanzas, but if it's set it will by default be inherited by all children elements who don't have a specific xml:lang
attribute on their own, as per the XML 1.0 spec:
The language specified by xml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance of xml:lang. In particular, the empty value of xml:lang is used on an element B to override a specification of xml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as if xml:lang had not been specified on B or any of its ancestors. Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described by xml:lang.
So when parsing a message:
- no
xml:lang
: the language is inherited from the stream if any, not set otherwise -
xml:lang=""
: the language for this element is unknown -
xml:lang="LANG"
: the language for this element is known and is LANG
However, when receiving a message from another client, there's no stream element (which is a c2s protocol element) to inherit the lang attribute from. However, as per the XML spec, the server does set the lang attribute (example: prosody) according to the c2s stream.
So it appears the only way for a client to properly not say what language a message is in (because it has no clue) is to use xml:lang=""
. However, it's not clear how automated systems wondering what languages you accept could fetch your preferred language (stream lang) in that case, leaving two possibilities:
- the automated system sends the language in as many locales as possible, letting the client use the one it pleases (unnecessary network bloat)
- fetch the language another way, for example via
LANG
key in vcard RFC