1. 08 Jan, 2019 13 commits
    • Jeff King's avatar
      prefer "hash mismatch" to "sha1 mismatch" · 01f8d594
      Jeff King authored
      To future-proof ourselves against a change in the hash, let's use the
      more generic "hash mismatch" to refer to integrity problems. Note that
      we do advertise this exact string in git-fsck(1). However, the message
      itself is marked for translation, meaning we do not expect it to be
      machine-readable.
      
      While we're touching that documentation, let's also update it for
      grammar and clarity.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      01f8d594
    • Jeff King's avatar
      sha1-file: avoid "sha1 file" for generic use in messages · 2c319886
      Jeff King authored
      These error messages say "sha1 file", which is vague and not common in
      user-facing documentation. Unlike the conversions from the previous
      commit, these do not always refer to loose objects.
      
      In finalize_object_file() we could be dealing with a packfile. Let's
      just say "unable to write file" instead; since we include the filename,
      the nature of the file is clear from the rest of the message.
      
      In force_object_loose(), we're calling into read_object(), which could
      actually be _any_ type of object. Just say "object".
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      2c319886
    • Jeff King's avatar
      sha1-file: prefer "loose object file" to "sha1 file" in messages · 76011357
      Jeff King authored
      When we're reporting an error for a loose object, let's use that term.
      It's more consistent with other parts of Git, and it is future-proof
      against changes to the hash function.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      76011357
    • Jeff King's avatar
      sha1-file: drop has_sha1_file() · 5d3679ee
      Jeff King authored
      There are no callers left of has_sha1_file() or its with_flags()
      variant. Let's drop them, and convert has_object_file() from a wrapper
      into the "real" function. Ironically, the sha1 variant was just copying
      into an object_id internally, so the resulting code is actually shorter!
      
      We can also drop the coccinelle rules for catching has_sha1_file()
      callers. Since the function no longer exists, the compiler will do that
      for us.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      5d3679ee
    • Jeff King's avatar
      convert has_sha1_file() callers to has_object_file() · 98374a07
      Jeff King authored
      The only remaining callers of has_sha1_file() actually have an object_id
      already. They can use the "object" variant, rather than dereferencing
      the hash themselves.
      
      The code changes here were completely generated by the included
      coccinelle patch.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      98374a07
    • Jeff King's avatar
      sha1-file: convert pass-through functions to object_id · d7a24573
      Jeff King authored
      These two static functions, read_object() and quick_has_loose(), both
      have to hashcpy() their bare-sha1 arguments into object_id structs to
      pass them along. Since all of their callers actually have object_id
      structs in the first place, we can eliminate the copying by adjusting
      their input parameters.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      d7a24573
    • René Scharfe's avatar
      object-store: retire odb_load_loose_cache() · 8be88dbc
      René Scharfe authored
      Inline odb_load_loose_cache() into its only remaining caller,
      odb_loose_cache().  The latter offers a nicer interface for loading the
      cache, as it doesn't require callers to deal with fanout directory
      numbers directly.
      Signed-off-by: default avatarRene Scharfe <l.s.r@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      8be88dbc
    • Jeff King's avatar
      sha1-file: modernize loose header/stream functions · 00a7760e
      Jeff King authored
      As with the open/map/close functions for loose objects that were
      recently converted, the functions for parsing the loose object stream
      use the name "sha1" and a bare "unsigned char *". Let's fix that so that
      unpack_sha1_header() becomes unpack_loose_header(), etc.
      
      These conversions are less clear-cut than the file access functions.
      You could argue that the they are parsing Git's canonical object format
      (i.e., "type size\0contents", over which we compute the hash), which is
      not strictly tied to loose storage. But in practice these functions are
      used only for loose objects, and using the term "loose_header" (instead
      of "object_header") distinguishes it from the object header found in
      packfiles (which contains the same information in a different format).
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      00a7760e
    • René Scharfe's avatar
      object-store: use one oid_array per subdirectory for loose cache · 4cea1ce0
      René Scharfe authored
      The loose objects cache is filled one subdirectory at a time as needed.
      It is stored in an oid_array, which has to be resorted after each add
      operation.  So when querying a wide range of objects, the partially
      filled array needs to be resorted up to 255 times, which takes over 100
      times longer than sorting once.
      
      Use one oid_array for each subdirectory.  This ensures that entries have
      to only be sorted a single time.  It also avoids eight binary search
      steps for each cache lookup as a small bonus.
      
      The cache is used for collision checks for the log placeholders %h, %t
      and %p, and we can see the change speeding them up in a repository with
      ca. 100 objects per subdirectory:
      
      $ git count-objects
      26733 objects, 68808 kilobytes
      
      Test                        HEAD^             HEAD
      --------------------------------------------------------------------
      4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
      4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
      4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
      4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
      4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
      4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
      4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%
      Reported-by: Ævar Arnfjörð Bjarmason's avatarÆvar Arnfjörð Bjarmason <avarab@gmail.com>
      Signed-off-by: default avatarRene Scharfe <l.s.r@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      4cea1ce0
    • Jeff King's avatar
      sha1-file: modernize loose object file functions · 514c5fdd
      Jeff King authored
      The loose object access code in sha1-file.c is some of the oldest in
      Git, and could use some modernizing. It mostly uses "unsigned char *"
      for object ids, which these days should be "struct object_id".
      
      It also uses the term "sha1_file" in many functions, which is confusing.
      The term "loose_objects" is much better. It clearly distinguishes
      them from packed objects (which didn't even exist back when the name
      "sha1_file" came into being). And it also distinguishes it from the
      checksummed-file concept in csum-file.c (which until recently was
      actually called "struct sha1file"!).
      
      This patch converts the functions {open,close,map,stat}_sha1_file() into
      open_loose_object(), etc, and switches their sha1 arguments for
      object_id structs. Similarly, path functions like fill_sha1_path()
      become fill_loose_path() and use object_ids.
      
      The function sha1_loose_object_info() already says "loose", so we can
      just drop the "sha1" (and teach it to use object_id).
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      514c5fdd
    • René Scharfe's avatar
      object-store: factor out odb_clear_loose_cache() · d4e19e51
      René Scharfe authored
      Add and use a function for emptying the loose object cache, so callers
      don't have to know any of its implementation details.
      Signed-off-by: default avatarRene Scharfe <l.s.r@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      d4e19e51
    • René Scharfe's avatar
      object-store: factor out odb_loose_cache() · 0000d654
      René Scharfe authored
      Add and use a function for loading the entries of a loose object
      subdirectory for a given object ID.  It frees callers from deriving the
      fanout key; they can use the returned oid_array reference for lookups or
      forward range scans.
      Suggested-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarRene Scharfe <l.s.r@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      0000d654
    • Jeff King's avatar
      sha1-file: fix outdated sha1 comment references · cb1c8d1d
      Jeff King authored
      Commit 17e65451 (sha1_file: convert check_sha1_signature to struct
      object_id, 2018-03-12) switched to using the name "oid", but forgot to
      update the variable name in the comment.
      
      Likewise, b4f5aca4 (sha1_file: convert read_sha1_file to struct
      object_id, 2018-03-12) dropped the name read_sha1_file(), but missed a
      comment which mentions it.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      cb1c8d1d
  2. 07 Jan, 2019 1 commit
  3. 24 Nov, 2018 1 commit
  4. 14 Nov, 2018 4 commits
  5. 13 Nov, 2018 6 commits
    • Jeff King's avatar
      sha1-file: use loose object cache for quick existence check · 61c7711c
      Jeff King authored
      In cases where we expect to ask has_sha1_file() about a lot of objects
      that we are not likely to have (e.g., during fetch negotiation), we
      already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
      a simultaneous write or repack) for speed (we avoid re-scanning the pack
      directory).
      
      However, even checking for loose objects can be expensive, as we will
      stat() each one. On many systems this cost isn't too noticeable, but
      stat() can be particularly slow on some operating systems, or due to
      network filesystems.
      
      Since the QUICK flag already tells us that we're OK with a slightly
      stale answer, we can use that as a cue to look in our in-memory cache of
      each object directory. That basically trades an in-memory binary search
      for a stat() call.
      
      Note that it is possible for this to actually be _slower_. We'll do a
      full readdir() to fill the cache, so if you have a very large number of
      loose objects and a very small number of lookups, that readdir() may end
      up more expensive.
      
      This shouldn't be a big deal in practice. If you have a large number of
      reachable loose objects, you'll already run into performance problems
      (which you should remedy by repacking). You may have unreachable objects
      which wouldn't otherwise impact performance. Usually these would go away
      with the prune step of "git gc", but they may be held for up to 2 weeks
      in the default configuration.
      
      So it comes down to how many such objects you might reasonably expect to
      have, how much slower is readdir() on N entries versus M stat() calls
      (and here we really care about the syscall backing readdir(), like
      getdents() on Linux, but I'll just call this readdir() below).
      
      If N is much smaller than M (a typical packed repo), we know this is a
      big win (few readdirs() followed by many uses of the resulting cache).
      When N and M are similar in size, it's also a win. We care about the
      latency of making a syscall, and readdir() should be giving us many
      values in a single call. How many?
      
      On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
      entries per call (which is 64 bytes per entry; the name itself is 38
      bytes, plus there are some other fields). So we can imagine that this is
      always a win as long as the number of loose objects in the repository is
      a factor of 500 less than the number of lookups you make. It's hard to
      auto-tune this because we don't generally know up front how many lookups
      we're going to do. But it's unlikely for this to perform significantly
      worse.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      61c7711c
    • Jeff King's avatar
      object-store: provide helpers for loose_objects_cache · 3a2e0824
      Jeff King authored
      Our object_directory struct has a loose objects cache that all users of
      the struct can see. But the only one that knows how to load the cache is
      find_short_object_filename(). Let's extract that logic in to a reusable
      function.
      
      While we're at it, let's also reset the cache when we re-read the object
      directories. This shouldn't have an impact on performance, as re-reads
      are meant to be rare (and are already expensive, so we avoid them with
      things like OBJECT_INFO_QUICK).
      
      Since the cache is already meant to be an approximation, it's tempting
      to skip even this bit of safety. But it's necessary to allow more code
      to use it. For instance, fetch-pack explicitly re-reads the object
      directory after performing its fetch, and would be confused if we didn't
      clear the cache.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      3a2e0824
    • Jeff King's avatar
      sha1-file: use an object_directory for the main object dir · f0eaf638
      Jeff King authored
      Our handling of alternate object directories is needlessly different
      from the main object directory. As a result, many places in the code
      basically look like this:
      
        do_something(r->objects->objdir);
      
        for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
              do_something(odb->path);
      
      That gets annoying when do_something() is non-trivial, and we've
      resorted to gross hacks like creating fake alternates (see
      find_short_object_filename()).
      
      Instead, let's give each raw_object_store a unified list of
      object_directory structs. The first will be the main store, and
      everything after is an alternate. Very few callers even care about the
      distinction, and can just loop over the whole list (and those who care
      can just treat the first element differently).
      
      A few observations:
      
        - we don't need r->objects->objectdir anymore, and can just
          mechanically convert that to r->objects->odb->path
      
        - object_directory's path field needs to become a real pointer rather
          than a FLEX_ARRAY, in order to fill it with expand_base_dir()
      
        - we'll call prepare_alt_odb() earlier in many functions (i.e.,
          outside of the loop). This may result in us calling it even when our
          function would be satisfied looking only at the main odb.
      
          But this doesn't matter in practice. It's not a very expensive
          operation in the first place, and in the majority of cases it will
          be a noop. We call it already (and cache its results) in
          prepare_packed_git(), and we'll generally check packs before loose
          objects. So essentially every program is going to call it
          immediately once per program.
      
          Arguably we should just prepare_alt_odb() immediately upon setting
          up the repository's object directory, which would save us sprinkling
          calls throughout the code base (and forgetting to do so has been a
          source of subtle bugs in the past). But I've stopped short of that
          here, since there are already a lot of other moving parts in this
          patch.
      
        - Most call sites just get shorter. The check_and_freshen() functions
          are an exception, because they have entry points to handle local and
          nonlocal directories separately.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      f0eaf638
    • Jeff King's avatar
      handle alternates paths the same as the main object dir · f3f043a1
      Jeff King authored
      When we generate loose file paths for the main object directory, the
      caller provides a buffer to loose_object_path (formerly sha1_file_name).
      The callers generally keep their own static buffer to avoid excessive
      reallocations.
      
      But for alternate directories, each struct carries its own scratch
      buffer. This is needlessly different; let's unify them.
      
      We could go either direction here, but this patch moves the alternates
      struct over to the main directory style (rather than vice-versa).
      Technically the alternates style is more efficient, as it avoids
      rewriting the object directory name on each call. But this is unlikely
      to matter in practice, as we avoid reallocations either way (and nobody
      has ever noticed or complained that the main object directory is copying
      a few extra bytes before making a much more expensive system call).
      
      And this has the advantage that the reusable buffers are tied to
      particular calls, which makes the invalidation rules simpler (for
      example, the return value from stat_sha1_file() used to be invalidated
      by basically any other object call, but now it is affected only by other
      calls to stat_sha1_file()).
      
      We do steal the trick from alt_sha1_path() of returning a pointer to the
      filled buffer, which makes a few conversions more convenient.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      f3f043a1
    • Jeff King's avatar
      sha1_file_name(): overwrite buffer instead of appending · b69fb867
      Jeff King authored
      The sha1_file_name() function is used to generate the path to a loose
      object in the object directory. It doesn't make much sense for it to
      append, since the the path we write may be absolute (i.e., you cannot
      reliably build up a path with it). Because many callers use it with a
      static buffer, they have to strbuf_reset() manually before each call
      (and the other callers always use an empty buffer, so they don't care
      either way). Let's handle this automatically.
      
      Since we're changing the semantics, let's take the opportunity to give
      it a more hash-neutral name (which will also catch any callers from
      topics in flight).
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      b69fb867
    • Jeff King's avatar
      rename "alternate_object_database" to "object_directory" · 263db403
      Jeff King authored
      In preparation for unifying the handling of alt odb's and the normal
      repo object directory, let's use a more neutral name. This patch is
      purely mechanical, swapping the type name, and converting any variables
      named "alt" to "odb". There should be no functional change, but it will
      reduce the noise in subsequent diffs.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      263db403
  6. 12 Nov, 2018 1 commit
    • Torsten Bögershausen's avatar
      Upcast size_t variables to uintmax_t when printing · ca473cef
      Torsten Bögershausen authored
      When printing variables which contain a size, today "unsigned long"
      is used at many places.
      In order to be able to change the type from "unsigned long" into size_t
      some day in the future, we need to have a way to print 64 bit variables
      on a system that has "unsigned long" defined to be 32 bit, like Win64.
      
      Upcast all those variables into uintmax_t before they are printed.
      This is to prepare for a bigger change, when "unsigned long"
      will be converted into size_t for variables which may be > 4Gib.
      Signed-off-by: default avatarTorsten Bögershausen <tboegi@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      ca473cef
  7. 22 Oct, 2018 2 commits
    • brian m. carlson's avatar
      sha1-file: provide functions to look up hash algorithms · 2f90b9d9
      brian m. carlson authored
      There are several ways we might refer to a hash algorithm: by name, such
      as in the config file; by format ID, such as in a pack; or internally,
      by a pointer to the hash_algos array.  Provide functions to look up hash
      algorithms based on these various forms and return the internal constant
      used for them.  If conversion to another form is necessary, this
      internal constant can be used to look up the proper data in the
      hash_algos array.
      Signed-off-by: brian m. carlson's avatarbrian m. carlson <sandals@crustytoothpaste.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      2f90b9d9
    • brian m. carlson's avatar
      sha1-file: rename algorithm to "sha1" · 1ccf07cb
      brian m. carlson authored
      The transition plan anticipates us using a syntax such as "^{sha1}" for
      disambiguation.  Since this is a syntax some people will be typing a
      lot, it makes sense to provide a short, easy-to-type syntax.  Omitting
      the dash doesn't create any ambiguity; however, it does make the syntax
      shorter and easier to type, especially for touch typists.  In addition,
      the transition plan already uses "sha1" in this context.
      
      Rename the name of SHA-1 implementation to "sha1".
      
      Note that this change creates no backwards compatibility concerns, since
      we haven't yet used this field in any configuration settings.
      Signed-off-by: brian m. carlson's avatarbrian m. carlson <sandals@crustytoothpaste.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      1ccf07cb
  8. 19 Oct, 2018 2 commits
  9. 21 Sep, 2018 1 commit
  10. 13 Sep, 2018 1 commit
    • Jonathan Tan's avatar
      fetch-object: unify fetch_object[s] functions · 8708ca09
      Jonathan Tan authored
      There are fetch_object() and fetch_objects() helpers in
      fetch-object.h; as the latter takes "struct oid_array",
      the former cannot be made into a thin wrapper around the
      latter without an extra allocation and set-up cost.
      
      Update fetch_objects() to take an array of "struct object_id"
      and number of elements in it as separate parameters, remove
      fetch_object(), and adjust all existing callers of these
      functions to use the new fetch_objects().
      Signed-off-by: default avatarJonathan Tan <jonathantanmy@google.com>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      8708ca09
  11. 29 Aug, 2018 3 commits
    • Jeff King's avatar
      convert "hashcmp() != 0" to "!hasheq()" · 67947c34
      Jeff King authored
      This rounds out the previous three patches, covering the
      inequality logic for the "hash" variant of the functions.
      
      As with the previous three, the accompanying code changes
      are the mechanical result of applying the coccinelle patch;
      see those patches for more discussion.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      67947c34
    • Jeff King's avatar
      convert "oidcmp() != 0" to "!oideq()" · 9001dc2a
      Jeff King authored
      This is the flip side of the previous two patches: checking
      for a non-zero oidcmp() can be more strictly expressed as
      inequality. Like those patches, we write "!= 0" in the
      coccinelle transformation, which covers by isomorphism the
      more common:
      
        if (oidcmp(E1, E2))
      
      As with the previous two patches, this patch can be achieved
      almost entirely by running "make coccicheck"; the only
      differences are manual line-wrap fixes to match the original
      code.
      
      There is one thing to note for anybody replicating this,
      though: coccinelle 1.0.4 seems to miss the case in
      builtin/tag.c, even though it's basically the same as all
      the others. Running with 1.0.7 does catch this, so
      presumably it's just a coccinelle bug that was fixed in the
      interim.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      9001dc2a
    • Jeff King's avatar
      convert "oidcmp() == 0" to oideq() · 4a7e27e9
      Jeff King authored
      Using the more restrictive oideq() should, in the long run,
      give the compiler more opportunities to optimize these
      callsites. For now, this conversion should be a complete
      noop with respect to the generated code.
      
      The result is also perhaps a little more readable, as it
      avoids the "zero is equal" idiom. Since it's so prevalent in
      C, I think seasoned programmers tend not to even notice it
      anymore, but it can sometimes make for awkward double
      negations (e.g., we can drop a few !!oidcmp() instances
      here).
      
      This patch was generated almost entirely by the included
      coccinelle patch. This mechanical conversion should be
      completely safe, because we check explicitly for cases where
      oidcmp() is compared to 0, which is what oideq() is doing
      under the hood. Note that we don't have to catch "!oidcmp()"
      separately; coccinelle's standard isomorphisms make sure the
      two are treated equivalently.
      
      I say "almost" because I did hand-edit the coccinelle output
      to fix up a few style violations (it mostly keeps the
      original formatting, but sometimes unwraps long lines).
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      4a7e27e9
  12. 13 Aug, 2018 2 commits
  13. 23 Jul, 2018 2 commits
  14. 16 Jul, 2018 1 commit