Skip to content
  • Jeff King's avatar
    fsck: lazily load types under --connectivity-only · a2b22854
    Jeff King authored and Junio C Hamano's avatar Junio C Hamano committed
    
    
    The recent fixes to "fsck --connectivity-only" load all of
    the objects with their correct types. This keeps the
    connectivity-only code path close to the regular one, but it
    also introduces some unnecessary inefficiency. While getting
    the type of an object is cheap compared to actually opening
    and parsing the object (as the non-connectivity-only case
    would do), it's still not free.
    
    For reachable non-blob objects, we end up having to parse
    them later anyway (to see what they point to), making our
    type lookup here redundant.
    
    For unreachable objects, we might never hit them at all in
    the reachability traversal, making the lookup completely
    wasted. And in some cases, we might have quite a few
    unreachable objects (e.g., when alternates are used for
    shared object storage between repositories, it's normal for
    there to be objects reachable from other repositories but
    not the one running fsck).
    
    The comment in mark_object_for_connectivity() claims two
    benefits to getting the type up front:
    
      1. We need to know the types during fsck_walk(). (And not
         explicitly mentioned, but we also need them when
         printing the types of broken or dangling commits).
    
         We can address this by lazy-loading the types as
         necessary. Most objects never need this lazy-load at
         all, because they fall into one of these categories:
    
           a. Reachable from our tips, and are coerced into the
    	  correct type as we traverse (e.g., a parent link
    	  will call lookup_commit(), which converts OBJ_NONE
    	  to OBJ_COMMIT).
    
           b. Unreachable, but not at the tip of a chunk of
              unreachable history. We only mention the tips as
    	  "dangling", so an unreachable commit which links
    	  to hundreds of other objects needs only report the
    	  type of the tip commit.
    
      2. It serves as a cross-check that the coercion in (1a) is
         correct (i.e., we'll complain about a parent link that
         points to a blob). But we get most of this for free
         already, because right after coercing, we'll parse any
         non-blob objects. So we'd notice then if we expected a
         commit and got a blob.
    
         The one exception is when we expect a blob, in which
         case we never actually read the object contents.
    
         So this is a slight weakening, but given that the whole
         point of --connectivity-only is to sacrifice some data
         integrity checks for speed, this seems like an
         acceptable tradeoff.
    
    Here are before and after timings for an extreme case with
    ~5M reachable objects and another ~12M unreachable (it's the
    torvalds/linux repository on GitHub, connected to shared
    storage for all of the other kernel forks):
    
      [before]
      $ time git fsck --no-dangling --connectivity-only
      real	3m4.323s
      user	1m25.121s
      sys	1m38.710s
    
      [after]
      $ time git fsck --no-dangling --connectivity-only
      real	0m51.497s
      user	0m49.575s
      sys	0m1.776s
    
    Signed-off-by: default avatarJeff King <peff@peff.net>
    Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
    a2b22854