Handle HTML entities when parsing URLs
I recently saw HTML entities within URLs, e.g. like in www.example.com/ü.txt (www.example.com/übel.txt).
Firefox translates ü into UTF-8 and thus uses %C3%BC within a GET request.
Wget1.x does no conversion and thus tries to GET ü... which was accepted by the server in my case as well. I did not test if a server like apache does translate this into the correct filename. Any opinions how we should treat this ?
There are also &#nnnn; and &#xhhhh; variants which needs Unicode to UTF-8 conversion (which is straight forward - I already have a working function).
For a list of named entities see http://www.w3.org/TR/html4/sgml/entities.html