1. 31 Jan, 2019 1 commit
    • Torsten Bögershausen's avatar
      Support working-tree-encoding "UTF-16LE-BOM" · aab2a1ae
      Torsten Bögershausen authored
      Users who want UTF-16 files in the working tree set the .gitattributes
      like this:
      test.txt working-tree-encoding=UTF-16
      
      The unicode standard itself defines 3 allowed ways how to encode UTF-16.
      The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:
      
      a) UTF-16, without BOM, big endian:
      $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
      0000000    g   i   t
      
      b) UTF-16, with BOM, little endian:
      $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
      0000000    g   i   t
      
      c) UTF-16, with BOM, big endian:
      $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
      0000000    g   i   t
      
      Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
      working tree.
      After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
      in the version (c) above.
      This is what iconv generates, more details follow below.
      
      iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
      
      d) UTF-16
      $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
      0000000  376 377  \0   g  \0   i  \0   t
      
      e) UTF-16LE
      $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
      0000000    g  \0   i  \0   t  \0
      
      f)  UTF-16BE
      $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
      0000000   \0   g  \0   i  \0   t
      
      There is no way to generate version (b) from above in a Git working tree,
      but that is what some applications need.
      (All fully unicode aware applications should be able to read all 3 variants,
      but in practise we are not there yet).
      
      When producing UTF-16 as an output, iconv generates the big endian version
      with a BOM. (big endian is probably chosen for historical reasons).
      
      iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
      as encoding, and that file does not have a BOM.
      
      Not all users (especially under Windows) are happy with this.
      Some tools are not fully unicode aware and can only handle version (b).
      
      Today there is no way to produce version (b) with iconv (or libiconv).
      Looking into the history of iconv, it seems as if version (c) will
      be used in all future iconv versions (for compatibility reasons).
      
      Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM".
      libiconv can not handle the encoding, so Git pick it up, handles the BOM
      and uses libiconv to convert the rest of the stream.
      (UTF-16BE-BOM is added for consistency)
      Rported-by: Adrián's avatarAdrián Gimeno Balaguer <adrigibal@gmail.com>
      Signed-off-by: 's avatarTorsten Bögershausen <tboegi@web.de>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      aab2a1ae
  2. 15 Aug, 2018 1 commit
  3. 24 Jul, 2018 1 commit
  4. 22 May, 2018 1 commit
    • Jeff King's avatar
      is_hfs_dotgit: match other .git files · 0fc333ba
      Jeff King authored
      Both verify_path() and fsck match ".git", ".GIT", and other
      variants specific to HFS+. Let's allow matching other
      special files like ".gitmodules", which we'll later use to
      enforce extra restrictions via verify_path() and fsck.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      0fc333ba
  5. 16 Apr, 2018 2 commits
  6. 06 May, 2016 1 commit
    • Li Peng's avatar
      typofix: assorted typofixes in comments, documentation and messages · 832c0e5e
      Li Peng authored
      Many instances of duplicate words (e.g. "the the path") and
      a few typoes are fixed, originally in multiple patches.
      
          wildmatch: fix duplicate words of "the"
          t: fix duplicate words of "output"
          transport-helper: fix duplicate words of "read"
          Git.pm: fix duplicate words of "return"
          path: fix duplicate words of "look"
          pack-protocol.txt: fix duplicate words of "the"
          precompose-utf8: fix typo of "sequences"
          split-index: fix typo
          worktree.c: fix typo
          remote-ext: fix typo
          utf8: fix duplicate words of "the"
          git-cvsserver: fix duplicate words
      Signed-off-by: 's avatarLi Peng <lip@dtdream.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      832c0e5e
  7. 17 Sep, 2015 1 commit
  8. 05 Jun, 2015 1 commit
  9. 16 Apr, 2015 1 commit
    • Junio C Hamano's avatar
      utf8-bom: introduce skip_utf8_bom() helper · dde843e7
      Junio C Hamano authored
      With the recent change to ignore the UTF8 BOM at the beginning of
      .gitignore files, we now have two codepaths that do such a skipping
      (the other one is for reading the configuration files).
      
      Introduce utf8_bom[] constant string and skip_utf8_bom() helper
      and teach .gitignore code how to use it.
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      dde843e7
  10. 17 Dec, 2014 1 commit
    • Jeff King's avatar
      utf8: add is_hfs_dotgit() helper · 6162a1d3
      Jeff King authored
      We do not allow paths with a ".git" component to be added to
      the index, as that would mean repository contents could
      overwrite our repository files. However, asking "is this
      path the same as .git" is not as simple as strcmp() on some
      filesystems.
      
      HFS+'s case-folding does more than just fold uppercase into
      lowercase (which we already handle with strcasecmp). It may
      also skip past certain "ignored" Unicode code points, so
      that (for example) ".gi\u200ct" is mapped ot ".git".
      
      The full list of folds can be found in the tables at:
      
        https://www.opensource.apple.com/source/xnu/xnu-1504.15.3/bsd/hfs/hfscommon/Unicode/UCStringCompareData.h
      
      Implementing a full "is this path the same as that path"
      comparison would require us importing the whole set of
      tables.  However, what we want to do is much simpler: we
      only care about checking ".git". We know that 'G' is the
      only thing that folds to 'g', and so on, so we really only
      need to deal with the set of ignored code points, which is
      much smaller.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      6162a1d3
  11. 10 Jul, 2013 1 commit
    • Jeff King's avatar
      add missing "format" function attributes · 4621085b
      Jeff King authored
      For most of our functions that take printf-like formats, we
      use gcc's __attribute__((format)) to get compiler warnings
      when the functions are misused. Let's give a few more
      functions the same protection.
      
      In most cases, the annotations do not uncover any actual
      bugs; the only code change needed is that we passed a size_t
      to transfer_debug, which expected an int. Since we expect
      the passed-in value to be a relatively small buffer size
      (and cast a similar value to int directly below), we can
      just cast away the problem.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      4621085b
  12. 18 Apr, 2013 4 commits
  13. 09 Mar, 2013 1 commit
    • Kirill Smelkov's avatar
      format-patch: RFC 2047 says multi-octet character may not be split · 6cd3c053
      Kirill Smelkov authored
      Even though an earlier attempt (bafc478f..41dd00ba) cleaned
      up RFC 2047 encoding, pretty.c::add_rfc2047() still decides
      where to split the output line by going through the input
      one byte at a time, and potentially splits a character in
      the middle.  A subject line may end up showing like this:
      
           ".... fö?? bar".   (instead of  ".... föö bar".)
      
      if split incorrectly.
      
      RFC 2047, section 5 (3) explicitly forbids such beaviour
      
          Each 'encoded-word' MUST represent an integral number of
          characters.  A multi-octet character may not be split across
          adjacent 'encoded- word's.
      
      that means that e.g. for
      
          Subject: .... föö bar
      
      encoding
      
          Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?=
           =?UTF-8?q?=20bar?=
      
      is correct, and
      
          Subject: =?UTF-8?q?....=20f=C3=B6=C3?=      <-- NOTE ö is broken here
           =?UTF-8?q?=B6=20bar?=
      
      is not, because "ö" character UTF-8 encoding C3 B6 is split here across
      adjacent encoded words.
      
      To fix the problem, make the loop grab one _character_ at a time and
      determine its output length to see where to break the output line.  Note
      that this version only knows about UTF-8, but the logic to grab one
      character is abstracted out in mbs_chrlen() function to make it possible
      to extend it to other encodings with the help of iconv in the future.
      Signed-off-by: 's avatarKirill Smelkov <kirr@mns.spb.ru>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      6cd3c053
  14. 11 Feb, 2013 1 commit
  15. 11 Dec, 2012 1 commit
  16. 04 Nov, 2012 1 commit
    • Junio C Hamano's avatar
      reencode_string(): introduce and use same_encoding() · 0e18bcd5
      Junio C Hamano authored
      Callers of reencode_string() that re-encodes a string from one
      encoding to another all used ad-hoc way to bypass the case where the
      input and the output encodings are the same.  Some did strcmp(),
      some did strcasecmp(), yet some others when converting to UTF-8 used
      is_encoding_utf8().
      
      Introduce same_encoding() helper function to make these callers use
      the same logic.  Notably, is_encoding_utf8() has a work-around for
      common misconfiguration to use "utf8" to name UTF-8 encoding, which
      does not match "UTF-8" hence strcasecmp() would not consider the
      same.  Make use of it in this helper function.
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      0e18bcd5
  17. 09 Jul, 2012 1 commit
    • Torsten Bögershausen's avatar
      git on Mac OS and precomposed unicode · 76759c7d
      Torsten Bögershausen authored
      Mac OS X mangles file names containing unicode on file systems HFS+,
      VFAT or SAMBA.  When a file using unicode code points outside ASCII
      is created on a HFS+ drive, the file name is converted into
      decomposed unicode and written to disk. No conversion is done if
      the file name is already decomposed unicode.
      
      Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same
      result as open("\x41\xcc\x88",...) with a decomposed "Ä".
      
      As a consequence, readdir() returns the file names in decomposed
      unicode, even if the user expects precomposed unicode.  Unlike on
      HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in
      precomposed unicode, but readdir() still returns file names in
      decomposed unicode.  When a git repository is stored on a network
      share using SAMBA, file names are send over the wire and written to
      disk on the remote system in precomposed unicode, but Mac OS X
      readdir() returns decomposed unicode to be compatible with its
      behaviour on HFS+ and VFAT.
      
      The unicode decomposition causes many problems:
      
      - The names "git add" and other commands get from the end user may
        often be precomposed form (the decomposed form is not easily input
        from the keyboard), but when the commands read from the filesystem
        to see what it is going to update the index with already is on the
        filesystem, readdir() will give decomposed form, which is different.
      
      - Similarly "git log", "git mv" and all other commands that need to
        compare pathnames found on the command line (often but not always
        precomposed form; a command line input resulting from globbing may
        be in decomposed) with pathnames found in the tree objects (should
        be precomposed form to be compatible with other systems and for
        consistency in general).
      
      - The same for names stored in the index, which should be
        precomposed, that may need to be compared with the names read from
        readdir().
      
      NFS mounted from Linux is fully transparent and does not suffer from
      the above.
      
      As Mac OS X treats precomposed and decomposed file names as equal,
      we can
      
       - wrap readdir() on Mac OS X to return the precomposed form, and
      
       - normalize decomposed form given from the command line also to the
         precomposed form,
      
      to ensure that all pathnames used in Git are always in the
      precomposed form.  This behaviour can be requested by setting
      "core.precomposedunicode" configuration variable to true.
      
      The code in compat/precomposed_utf8.c implements basically 4 new
      functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(),
      precomposed_utf8_closedir() and precompose_argv().  The first three
      are to wrap opendir(3), readdir(3), and closedir(3) functions.
      
      The argv[] conversion allows to use the TAB filename completion done
      by the shell on command line.  It tolerates other tools which use
      readdir() to feed decomposed file names into git.
      
      When creating a new git repository with "git init" or "git clone",
      "core.precomposedunicode" will be set "false".
      
      The user needs to activate this feature manually.  She typically
      sets core.precomposedunicode to "true" on HFS and VFAT, or file
      systems mounted via SAMBA.
      Helped-by: 's avatarJunio C Hamano <gitster@pobox.com>
      Signed-off-by: 's avatarTorsten Bögershausen <tboegi@web.de>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      76759c7d
  18. 23 Feb, 2011 1 commit
    • Jeff King's avatar
      strbuf: add fixed-length version of add_wrapped_text · 98acc837
      Jeff King authored
      The function strbuf_add_wrapped_text takes a NUL-terminated
      string. This makes it annoying to wrap strings we have as a
      pointer and a length.
      
      Refactoring strbuf_add_wrapped_text and all of its
      sub-functions to handle fixed-length strings turned out to
      be really ugly. So this implementation is lame; it just
      strdups the text and operates on the NUL-terminated version.
      This should be fine as the strings we are wrapping are
      generally pretty short.  If it becomes a problem, we can
      optimize later.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      98acc837
  19. 20 Feb, 2010 1 commit
  20. 12 Jan, 2010 1 commit
  21. 19 Oct, 2009 1 commit
  22. 05 Feb, 2009 1 commit
  23. 07 Jan, 2008 2 commits
  24. 28 Feb, 2007 1 commit
  25. 30 Dec, 2006 1 commit
  26. 26 Dec, 2006 1 commit
  27. 24 Dec, 2006 1 commit
    • Johannes Schindelin's avatar
      commit-tree: encourage UTF-8 commit messages. · 9e832665
      Johannes Schindelin authored
      Introduce is_utf() to check if a text looks like it is encoded
      in UTF-8, utf8_width() to count display width, and implements
      print_wrapped_text() using them.
      
      git-commit-tree warns if the commit message does not minimally
      conform to the UTF-8 encoding when i18n.commitencoding is either
      unset, or set to "utf-8".
      Signed-off-by: 's avatarJunio C Hamano <junkio@cox.net>
      9e832665