On hosts without the 'en_US.UTF-8' locale, output from popen_with_timeout will be US-ASCII encoded rather than UTF-8
Gitaly sets environment variable LANG=en_US.UTF-8
for all child git processes, as well as Gitaly-Ruby. On hosts that do not have the en_US.UTF-8
locale installed this causes Ruby to fall back to US-ASCII
(7-bit) encoding by default, even if they are using a different UTF-8 locale.
As a result, UTF-8 encoded output from popen_with_timeout
will trigger ArgumentError: invalid byte sequence in US-ASCII
errors when string parsing methods are executed on it. A customer experienced this with sanitize_url
when UpdateRemoteMirror
returned an error, but there are probably other places this could occur as well.
On a host using de_DE.UTF-8
as its locale:
$ LANG=de_DE.UTF-8 /opt/gitlab/embedded/ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>
$ LANG=en_US.UTF-8 /opt/gitlab/embedded/ruby -e 'p Encoding.default_external'
#<Encoding:US-ASCII>
$ locale -a
C
C.UTF-8
de_DE.utf8
POSIX
$ locale -v
LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_PAPER="de_DE.UTF-8"
LC_NAME="de_DE.UTF-8"
LC_ADDRESS="de_DE.UTF-8"
LC_TELEPHONE="de_DE.UTF-8"
LC_MEASUREMENT="de_DE.UTF-8"
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=
On a host using en_US.UTF-8
and de_DE.UTF-8
installed both work:
$ LANG=en_US.UTF-8 /opt/gitlab/embedded/ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>
LANG=de_DE.UTF-8 ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>
LANG=ar_EG.UTF-8 ruby -e 'p Encoding.default_external' # Using another UTF-8 encoding as an example
#<Encoding:US-ASCII>
$ locale -a
C
C.UTF-8
POSIX
de_DE.utf8
en_US.utf8
$ locale -v
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Edited by Will Chandler (ex-GitLab)