1. 16 Apr, 2019 1 commit
    • Jeff King's avatar
      pack-revindex: open index if necessary · 4828ce98
      Jeff King authored
      We can't create a pack revindex if we haven't actually looked at the
      index. Normally we would never get as far as creating a revindex without
      having already been looking in the pack, so this code never bothered to
      double-check that pack->index_data had been loaded.
      
      But with the new multi-pack-index feature, many code paths might not
      load the individual pack .idx at all (they'd find objects via the midx
      and then open the .pack, but not its index).
      
      This can't yet be triggered in practice, because a bug in the midx code
      means we accidentally open up the individual .idx files anyway. But in
      preparation for fixing that, let's have the revindex code check that
      everything it needs has been loaded.
      
      In most cases this will just be a quick noop. But note that this does
      introduce a possibility of error (if we have to open the index and it's
      corrupt), so load_pack_revindex() now returns a result code, and callers
      need to handle the error.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      4828ce98
  2. 04 Feb, 2019 1 commit
  3. 15 Oct, 2018 1 commit
  4. 26 Mar, 2018 1 commit
  5. 19 Jan, 2018 1 commit
  6. 30 Jan, 2017 1 commit
    • René Scharfe's avatar
      use SWAP macro · 35d803bc
      René Scharfe authored
      Apply the semantic patch swap.cocci to convert hand-rolled swaps to use
      the macro SWAP.  The resulting code is shorter and easier to read, the
      object code is effectively unchanged.
      
      The patch for object.c had to be hand-edited in order to preserve the
      comment before the change; Coccinelle tried to eat it for some reason.
      Signed-off-by: default avatarRene Scharfe <l.s.r@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      35d803bc
  7. 25 Sep, 2016 1 commit
    • René Scharfe's avatar
      use COPY_ARRAY · 45ccef87
      René Scharfe authored
      Add a semantic patch for converting certain calls of memcpy(3) to
      COPY_ARRAY() and apply that transformation to the code base.  The result
      is
       shorter and safer code.  For now only consider calls where source and
      destination have the same type, or in other words: easy cases.
      Signed-off-by: default avatarRene Scharfe <l.s.r@web.de>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      45ccef87
  8. 22 Feb, 2016 1 commit
  9. 21 Dec, 2015 2 commits
    • Jeff King's avatar
      pack-revindex: store entries directly in packed_git · 9d98bbf5
      Jeff King authored
      A pack_revindex struct has two elements: the revindex
      entries themselves, and a pointer to the packed_git. We need
      both to do lookups, because only the latter knows things
      like the number of objects in the pack.
      
      Now that packed_git contains the pack_revindex struct it's
      just as easy to pass around the packed_git itself, and we do
      not need the extra back-pointer.
      
      We can instead just store the entries directly in the pack.
      All functions which took a pack_revindex now just take a
      packed_git. We still lazy-load in find_pack_revindex, so
      most callers are unaffected.
      
      The exception is the bitmap code, which computes the
      revindex and caches the pointer when we load the bitmaps. We
      can continue to load, drop the extra cache pointer, and just
      access bitmap_git.pack.revindex directly.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      9d98bbf5
    • Jeff King's avatar
      pack-revindex: drop hash table · f4015337
      Jeff King authored
      The main entry point to the pack-revindex code is
      find_pack_revindex(). This calls revindex_for_pack(), which
      lazily computes and caches the revindex for the pack.
      
      We store the cache in a very simple hash table. It's created
      by init_pack_revindex(), which inserts an entry for every
      packfile we know about, and we never grow or shrink the
      hash. If we ever need the revindex for a pack that isn't in
      the hash, we die() with an internal error.
      
      This can lead to a race, because we may load more packs
      after having called init_pack_revindex(). For example,
      imagine we have one process which needs to look at the
      revindex for a variety of objects (e.g., cat-file's
      "%(objectsize:disk)" format).  Simultaneously, git-gc is
      running, which is doing a `git repack -ad`. We might hit a
      sequence like:
      
        1. We need the revidx for some packed object. We call
           find_pack_revindex() and end up in init_pack_revindex()
           to create the hash table for all packs we know about.
      
        2. We look up another object and can't find it, because
           the repack has removed the pack it's in. We re-scan the
           pack directory and find a new pack containing the
           object. It gets added to our packed_git list.
      
        3. We call find_pack_revindex() for the new object, which
           hits revindex_for_pack() for our new pack. It can't
           find the packed_git in the revindex hash, and dies.
      
      You could also replace the `repack` above with a push or
      fetch to create a new pack, though these are less likely
      (you would have to somehow learn about the new objects to
      look them up).
      
      Prior to 1a6d8b91 (do not discard revindex when re-preparing
      packfiles, 2014-01-15), this was safe, as we threw away the
      revindex whenever we re-scanned the pack directory (and thus
      re-created the revindex hash on the fly). However, we don't
      want to simply revert that commit, as it was solving a
      different race.
      
      So we have a few options:
      
        - We can fix the race in 1a6d8b91 differently, by having
          the bitmap code look in the revindex hash instead of
          caching the pointer. But this would introduce a lot of
          extra hash lookups for common bitmap operations.
      
        - We could teach the revindex to dynamically add new packs
          to the hash table. This would perform the same, but
          would mean adding extra code to the revindex hash (which
          currently cannot be resized at all).
      
        - We can get rid of the hash table entirely. There is
          exactly one revindex per pack, so we can just store it
          in the packed_git struct. Since it's initialized lazily,
          it does not add to the startup cost.
      
          This is the best of both worlds: less code and fewer
          hash table lookups.  The original code likely avoided
          this in the name of encapsulation. But the packed_git
          and reverse_index code are fairly intimate already, so
          it's not much of a loss.
      
      This patch implements the final option. It's a minimal
      conversion that retains the pack_revindex struct. No callers
      need to change, and we can do further cleanup in a follow-on
      patch.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      f4015337
  10. 26 Oct, 2015 1 commit
  11. 27 May, 2014 1 commit
  12. 16 Jan, 2014 1 commit
    • Jeff King's avatar
      do not discard revindex when re-preparing packfiles · 1a6d8b91
      Jeff King authored
      When an object lookup fails, we re-read the objects/pack
      directory to pick up any new packfiles that may have been
      created since our last read. We also discard any pack
      revindex structs we've allocated.
      
      The discarding is a problem for the pack-bitmap code, which keeps
      a pointer to the revindex for the bitmapped pack. After the
      discard, the pointer is invalid, and we may read free()d
      memory.
      
      Other revindex users do not keep a bare pointer to the
      revindex; instead, they always access it through
      revindex_for_pack(), which lazily builds the revindex. So
      one solution is to teach the pack-bitmap code a similar
      trick. It would be slightly less efficient, but probably not
      all that noticeable.
      
      However, it turns out this discarding is not actually
      necessary. When we call reprepare_packed_git, we do not
      throw away our old pack list. We keep the existing entries,
      and only add in new ones. So there is no safety problem; we
      will still have the pack struct that matches each revindex.
      The packfile itself may go away, of course, but we are
      already prepared to handle that, and it may happen outside
      of reprepare_packed_git anyway.
      
      Throwing away the revindex may save some RAM if the pack
      never gets reused (about 12 bytes per object). But it also
      wastes some CPU time (to regenerate the index) if the pack
      does get reused. It's hard to say which is more valuable,
      but in either case, it happens very rarely (only when we
      race with a simultaneous repack). Just leaving the revindex
      in place is simple and safe both for current and future
      code.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      1a6d8b91
  13. 24 Oct, 2013 1 commit
  14. 12 Jul, 2013 2 commits
    • Jeff King's avatar
      pack-revindex: radix-sort the revindex · 8b8dfd51
      Jeff King authored
      The pack revindex stores the offsets of the objects in the
      pack in sorted order, allowing us to easily find the on-disk
      size of each object. To compute it, we populate an array
      with the offsets from the sha1-sorted idx file, and then use
      qsort to order it by offsets.
      
      That does O(n log n) offset comparisons, and profiling shows
      that we spend most of our time in cmp_offset. However, since
      we are sorting on a simple off_t, we can use numeric sorts
      that perform better. A radix sort can run in O(k*n), where k
      is the number of "digits" in our number. For a 64-bit off_t,
      using 16-bit "digits" gives us k=4.
      
      On the linux.git repo, with about 3M objects to sort, this
      yields a 400% speedup. Here are the best-of-five numbers for
      running
      
        echo HEAD | git cat-file --batch-check="%(objectsize:disk)
      
      on a fully packed repository, which is dominated by time
      spent building the pack revindex:
      
                before     after
        real    0m0.834s   0m0.204s
        user    0m0.788s   0m0.164s
        sys     0m0.040s   0m0.036s
      
      This matches our algorithmic expectations. log(3M) is ~21.5,
      so a traditional sort is ~21.5n. Our radix sort runs in k*n,
      where k is the number of radix digits. In the worst case,
      this is k=4 for a 64-bit off_t, but we can quit early when
      the largest value to be sorted is smaller. For any
      repository under 4G, k=2. Our algorithm makes two passes
      over the list per radix digit, so we end up with 4n. That
      should yield ~5.3x speedup. We see 4x here; the difference
      is probably due to the extra bucket book-keeping the radix
      sort has to do.
      
      On a smaller repo, the difference is less impressive, as
      log(n) is smaller. For git.git, with 173K objects (but still
      k=2), we see a 2.7x improvement:
      
                before     after
        real    0m0.046s   0m0.017s
        user    0m0.036s   0m0.012s
        sys     0m0.008s   0m0.000s
      
      On even tinier repos (e.g., a few hundred objects), the
      speedup goes away entirely, as the small advantage of the
      radix sort gets erased by the book-keeping costs (and at
      those sizes, the cost to generate the the rev-index gets
      lost in the noise anyway).
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Reviewed-by: default avatarBrandon Casey <drafnel@gmail.com>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      8b8dfd51
    • Jeff King's avatar
      pack-revindex: use unsigned to store number of objects · 012b32bb
      Jeff King authored
      A packfile may have up to 2^32-1 objects in it, so the
      "right" data type to use is uint32_t. We currently use a
      signed int, which means that we may behave incorrectly for
      packfiles with more than 2^31-1 objects on 32-bit systems.
      
      Nobody has noticed because having 2^31 objects is pretty
      insane. The linux.git repo has on the order of 2^22 objects,
      which is hundreds of times smaller than necessary to trigger
      the bug.
      
      Let's bump this up to an "unsigned". On 32-bit systems, this
      gives us the correct data-type, and on 64-bit systems, it is
      probably more efficient to use the native "unsigned" than a
      true uint32_t.
      
      While we're at it, we can fix the binary search not to
      overflow in such a case if our unsigned is 32 bits.
      Signed-off-by: default avatarJeff King <peff@peff.net>
      Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
      012b32bb
  15. 23 Jul, 2009 1 commit
  16. 02 Nov, 2008 1 commit
  17. 23 Aug, 2008 1 commit
  18. 24 Jun, 2008 1 commit
  19. 01 Mar, 2008 1 commit