The TR46 non-transitional preprocessing removes these characters and also several others. RFC 5890 basically defines a 'label' (the parts separated by dots in a domain name) consisting only of ASCII letter, digits and hyphens. So yes, this is expected behavior with IDN2_NONTRANSITIONAL.
IDN2_TRANSITIONAL would leave those characters in place. This is definitely more backward compatible to IDNA 2003 and obsolete (by IDNA 2008) domain names.
BTW, you can leave IDN2_NFC_INPUT away. It is implicitly used by IDN2_NONTRANSITIONAL and IDN2_TRANSITIONAL.
Hm, can you describe where exactly in the RFC this behaviour is described? https://tools.ietf.org/html/rfc5891#section-5.4 gives a list of specific disallowed characters (which "_" is not on afaics), and then says “The string that has now been validated for lookup is converted to ACE form by applying the Punycode algorithm to the string and then adding the ACE prefix ("xn--").”. Nowhere is stripping of characters mentioned.
Are labels which contain underscore the only concern there? There could be a flag which skips these labels for processing, allowing behavior similar to libidn where one could pass resource records similarly to hostnames. A quick and dirty proof of concept is attached.
Hm, can you describe where exactly in the RFC this behaviour is described?
The RFCs don't specifically say drop these characters in processing, that's libidn2 behavior. The RFCs define labels as something containing only specific ascii chars.
So this is general problem... but I think TR46 proposes a flag for that. Not sure if it is functional (IDN2_ALLOW_UNASSIGNED).
We may want to use this flag then for that. I experimented passing verbatim characters not in a map when this flag is present but got an error later in processing.
I try again... it is TR46 that filters the character out. From the IdnaMappingTable.txt:
005B..0060 ; disallowed_STD3_valid # 1.1 LEFT SQUARE BRACKET..GRAVE ACCENT
From the spec:
4.1.1 UseSTD3ASCIIRulesIf UseSTD3ASCIIRules=false, then the validity tests for ASCII characters are not provided by the table status values, but are implementation-dependent. For example, if an implementation allows the characters [\u002Da-zA-Z0-9] and also the underbar ( _ ), then it needs to use the table values for UseSTD3ASCIIRules=false, and test for any other ASCII characters as part of its validity criteria. These ASCII characters may have resulted from a mapping: for example, a U+005F ( _ ) LOW LINE (underbar) may have originally been a U+FF3F ( _ ) FULLWIDTH LOW LINE.There are a very small number of non-ASCII characters with the data file status disallowed_STD3_valid:U+2260 ( ≠ ) NOT EQUAL TOU+226E ( ≮ ) NOT LESS-THANU+226F ( ≯ ) NOT GREATER-THANThose characters are disallowed with UseSTD3ASCIIRules=true because the set of characters in their canonical decompositions are not entirely in the valid set (Step 7 of the Table Derivation). However, they are allowed with UseSTD3ASCIIRules=false, because the base characters of their canonical decompositions, U+003D ( = ) EQUALS SIGN, U+003C ( < ) LESS-THAN SIGN, and U+003E ( > ) GREATER-THAN SIGN, are each valid under that option. If an implementation uses UseSTD3ASCIIRules=false but disallows any of these three ASCII characters, then it must also disallow the corresponding precomposed character for its negation.
I think, we don't have the STD3ASCII flag implemented yet, have we ?
We have these flags (TR46_FLG_DISALLOWED_STD3_VALID and TR46_FLG_DISALLOWED_STD3_MAPPED) already in the characte map, but just don't provide a flag for the API.
@keszybz Allowed characters are first defined in RFC952:
A "name" (Net, Host, Gateway, or Domain name) is a text string up to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus sign (-), and period (.). Note that periods are only allowed when they serve to delimit components of "domain style names". (See RFC-921, "Domain Name System Implementation Schedule", for background). No blank or space characters are permitted as part of a name. No distinction is made between upper and lower case. The first character must be an alpha character. The last character must not be a minus sign or period.
RFC1123 also allowed a digit as first character.
AFAIK, this is still true. IDNA transforms international strings/domains into this old naming scheme (doing some processing and then using the punycode_encode algorithm). I wish we could simply use UTF-8 instead.
I'm not sure what the right solution is, so let me describe the problem better:
underscores are used in DNS names for example to specify service fields (_tcp, _http, …, e.g. RFC 6698). The underscore is used because it is not allowed in host names (RFC 1123, §2.1) [as you wrote above while I was typing this...] but allowed in DNS labels. Such labels are automatically constructed by combining a user-specified domain and the prefix (e.g. _443._tcp. to resolve TLS certificates for HTTPS). In particular, this might be done for a domain like faß.de.
What we did so far was to take the address and pass it through IDNA encoding, and resolve that. With libidn, we had _443._tcp.faß.de encoded as _443._tcp.fass.de. With libidn2 and IDN2_NONTRANSITIONAL I get 443.tcp.xn--fa-hia.de, which cannot work. With libidn2 and IDN2_TRANSITIONAL I get _443._tcp.fass.de. But I really need _443._tcp.xn--fa-hia.de, i.e. the new rules but with underscores preserved.
I have very strong doubts about anything which is not round-trippable, but I need to look at this some more. I'll give your patch a test.
Default for IDNA2008/TR46 processing is UseSTD3ASCIIRules=true. What you need is UseSTD3ASCIIRules=false, which we didn't implement yet (maybe Nikos's patch above does it). With that you have to check your domain string for validity yourself because you circumvent some of the internal tests.
What you could do right now is to pass only the last part from your string to the idn2_ function. You know already that the first part is fine and needs no processing (_443._tcp.). IDNA processing is always label-by-label, so it's fine to split the input string that way.
What you could do right now is to pass only the last part from your string to the idn2_ function.
This would be problematic. Right now the client constructs a name and send a query to a daemon to have it resolved, as utf-8. And the daemon takes care of idn processing (for DNS) or not (e.g. for LLMNR). So doing that would require both the client to be much smarter, and extra communication about the meaning of specific labels… I'd rather not go there.
Another patch which takes advantage of Tim's advice above
$ systemd-resolve _443._tcp.faß.de_443._tcp.faß.de: 72.52.4.119 (_443._tcp.xn--fa-hia.de)-- Information acquired via protocol DNS in 1.6ms.-- Data is authenticated: no
Fixed up @nmav's patch, added --usestd3asciirules to idn2, changing default behavior to not use STD3 ascii rules. These rules can be enabled with the IDN2_USE_STD3_ASCII_RULES flag.
Unicode's TR46 document wants STD3 be enabled by default... so I am not sure if we should work against it. The plus is that with patch !51 (closed) we follow old libidn/IDNA2003 behavior.