1. 14 Feb, 2019 1 commit
    • Jeff King's avatar
      prune: use bitmaps for reachability traversal · fde67d68
      Jeff King authored
      Pruning generally has to traverse the whole commit graph in order to
      see which objects are reachable. This is the exact problem that
      reachability bitmaps were meant to solve, so let's use them (if they're
      available, of course).
      
      Here are timings on git.git:
      
        Test                            HEAD^             HEAD
        ------------------------------------------------------------------------
        5304.6: prune with bitmaps      3.65(3.56+0.09)   1.01(0.92+0.08) -72.3%
      
      And on linux.git:
      
        Test                            HEAD^               HEAD
        --------------------------------------------------------------------------
        5304.6: prune with bitmaps      35.05(34.79+0.23)   3.00(2.78+0.21) -91.4%
      
      The tests show a pretty optimal case, as we'll have just repacked and
      should have pretty good coverage of all refs with our bitmaps. But
      that's actually pretty realistic: normally prune is run via "gc" right
      after repacking.
      
      A few notes on the implementation:
      
        - the change is actually in reachable.c, so it would improve
          reachability traversals by "reflog expire --stale-fix", as well.
          Those aren't performed regularly, though (a normal "git gc" doesn't
          use --stale-fix), so they're not really worth measuring. There's a
          low chance of regressing that caller, since the use of bitmaps is
          totally transparent from the caller's perspective.
      
        - The bitmap case could actually get away without creating a "struct
          object", and instead the caller could just look up each object id in
          the bitmap result. However, this would be a marginal improvement in
          runtime, and it would make the callers much more complicated. They'd
          have to handle both the bitmap and non-bitmap cases separately, and
          in the case of git-prune, we'd also have to tweak prune_shallow(),
          which relies on our SEEN flags.
      
        - Because we do create real object structs, we go through a few
          contortions to create ones of the right type. This isn't strictly
          necessary (lookup_unknown_object() would suffice), but it's more
          memory efficient to use the correct types, since we already know
          them.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      fde67d68
  2. 29 Jun, 2018 3 commits
  3. 26 Apr, 2018 1 commit
  4. 26 Mar, 2018 1 commit
  5. 14 Mar, 2018 1 commit
    • brian m. carlson's avatar
      sha1_file: convert sha1_object_info* to object_id · abef9020
      brian m. carlson authored
      Convert sha1_object_info and sha1_object_info_extended to take pointers
      to struct object_id and rename them to use "oid" instead of "sha1" in
      their names.  Update the declaration and definition and apply the
      following semantic patch, plus the standard object_id transforms:
      
      @@
      expression E1, E2;
      @@
      - sha1_object_info(E1.hash, E2)
      + oid_object_info(&E1, E2)
      
      @@
      expression E1, E2;
      @@
      - sha1_object_info(E1->hash, E2)
      + oid_object_info(E1, E2)
      
      @@
      expression E1, E2, E3;
      @@
      - sha1_object_info_extended(E1.hash, E2, E3)
      + oid_object_info_extended(&E1, E2, E3)
      
      @@
      expression E1, E2, E3;
      @@
      - sha1_object_info_extended(E1->hash, E2, E3)
      + oid_object_info_extended(E1, E2, E3)
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      abef9020
  6. 14 Feb, 2018 1 commit
  7. 24 Aug, 2017 1 commit
    • Duy Nguyen's avatar
      revision.c: --all adds HEAD from all worktrees · d0c39a49
      Duy Nguyen authored
      Unless single_worktree is set, --all now adds HEAD from all worktrees.
      
      Since reachable.c code does not use setup_revisions(), we need to call
      other_head_refs_submodule() explicitly there to have the same effect on
      "git prune", so that we won't accidentally delete objects needed by some
      other HEADs.
      
      A new FIXME is added because we would need something like
      
          int refs_other_head_refs(struct ref_store *, each_ref_fn, cb_data);
      
      in addition to other_head_refs() to handle it, which might require
      
          int get_submodule_worktrees(const char *submodule, int flags);
      
      It could be a separate topic to reduce the scope of this one.
      Signed-off-by: Duy Nguyen's avatarNguyễn Thái Ngọc Duy <pclouds@gmail.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      d0c39a49
  8. 23 Aug, 2017 1 commit
  9. 08 May, 2017 3 commits
    • brian m. carlson's avatar
      object: convert parse_object* to take struct object_id · c251c83d
      brian m. carlson authored
      Make parse_object, parse_object_or_die, and parse_object_buffer take a
      pointer to struct object_id.  Remove the temporary variables inserted
      earlier, since they are no longer necessary.  Transform all of the
      callers using the following semantic patch:
      
      @@
      expression E1;
      @@
      - parse_object(E1.hash)
      + parse_object(&E1)
      
      @@
      expression E1;
      @@
      - parse_object(E1->hash)
      + parse_object(E1)
      
      @@
      expression E1, E2;
      @@
      - parse_object_or_die(E1.hash, E2)
      + parse_object_or_die(&E1, E2)
      
      @@
      expression E1, E2;
      @@
      - parse_object_or_die(E1->hash, E2)
      + parse_object_or_die(E1, E2)
      
      @@
      expression E1, E2, E3, E4, E5;
      @@
      - parse_object_buffer(E1.hash, E2, E3, E4, E5)
      + parse_object_buffer(&E1, E2, E3, E4, E5)
      
      @@
      expression E1, E2, E3, E4, E5;
      @@
      - parse_object_buffer(E1->hash, E2, E3, E4, E5)
      + parse_object_buffer(E1, E2, E3, E4, E5)
      Signed-off-by: brian m. carlson's avatarbrian m. carlson <sandals@crustytoothpaste.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      c251c83d
    • brian m. carlson's avatar
      Convert lookup_tree to struct object_id · 740ee055
      brian m. carlson authored
      Convert the lookup_tree function to take a pointer to struct object_id.
      
      The commit was created with manual changes to tree.c, tree.h, and
      object.c, plus the following semantic patch:
      
      @@
      @@
      - lookup_tree(EMPTY_TREE_SHA1_BIN)
      + lookup_tree(&empty_tree_oid)
      
      @@
      expression E1;
      @@
      - lookup_tree(E1.hash)
      + lookup_tree(&E1)
      
      @@
      expression E1;
      @@
      - lookup_tree(E1->hash)
      + lookup_tree(E1)
      Signed-off-by: brian m. carlson's avatarbrian m. carlson <sandals@crustytoothpaste.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      740ee055
    • brian m. carlson's avatar
      Convert lookup_blob to struct object_id · 3aca1fc6
      brian m. carlson authored
      Convert lookup_blob to take a pointer to struct object_id.
      
      The commit was created with manual changes to blob.c and blob.h, plus
      the following semantic patch:
      
      @@
      expression E1;
      @@
      - lookup_blob(E1.hash)
      + lookup_blob(&E1)
      
      @@
      expression E1;
      @@
      - lookup_blob(E1->hash)
      + lookup_blob(E1)
      Signed-off-by: brian m. carlson's avatarbrian m. carlson <sandals@crustytoothpaste.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      3aca1fc6
  10. 27 Apr, 2017 1 commit
    • Johannes Schindelin's avatar
      timestamp_t: a new data type for timestamps · dddbad72
      Johannes Schindelin authored
      Git's source code assumes that unsigned long is at least as precise as
      time_t. Which is incorrect, and causes a lot of problems, in particular
      where unsigned long is only 32-bit (notably on Windows, even in 64-bit
      versions).
      
      So let's just use a more appropriate data type instead. In preparation
      for this, we introduce the new `timestamp_t` data type.
      
      By necessity, this is a very, very large patch, as it has to replace all
      timestamps' data type in one go.
      
      As we will use a data type that is not necessarily identical to `time_t`,
      we need to be very careful to use `time_t` whenever we interact with the
      system functions, and `timestamp_t` everywhere else.
      Signed-off-by: Johannes Schindelin's avatarJohannes Schindelin <johannes.schindelin@gmx.de>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      dddbad72
  11. 22 Feb, 2017 1 commit
  12. 09 May, 2016 1 commit
  13. 16 Mar, 2016 2 commits
    • Jeff King's avatar
      list-objects: pass full pathname to callbacks · 2824e184
      Jeff King authored
      When we find a blob at "a/b/c", we currently pass this to
      our show_object_fn callbacks as two components: "a/b/" and
      "c". Callbacks which want the full value then call
      path_name(), which concatenates the two. But this is an
      inefficient interface; the path is a strbuf, and we could
      simply append "c" to it temporarily, then roll back the
      length, without creating a new copy.
      
      So we could improve this by teaching the callsites of
      path_name() this trick (and there are only 3). But we can
      also notice that no callback actually cares about the
      broken-down representation, and simply pass each callback
      the full path "a/b/c" as a string. The callback code becomes
      even simpler, then, as we do not have to worry about freeing
      an allocated buffer, nor rolling back our modification to
      the strbuf.
      
      This is theoretically less efficient, as some callbacks
      would not bother to format the final path component. But in
      practice this is not measurable. Since we use the same
      strbuf over and over, our work to grow it is amortized, and
      we really only pay to memcpy a few bytes.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      2824e184
    • Jeff King's avatar
      list-objects: drop name_path entirely · dc06dc88
      Jeff King authored
      In the previous commit, we left name_path as a thin wrapper
      around a strbuf. This patch drops it entirely. As a result,
      every show_object_fn callback needs to be adjusted. However,
      none of their code needs to be changed at all, because the
      only use was to pass it to path_name(), which now handles
      the bare strbuf.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      dc06dc88
  14. 12 Feb, 2016 2 commits
    • Jeff King's avatar
      list-objects: pass full pathname to callbacks · de1e67d0
      Jeff King authored
      When we find a blob at "a/b/c", we currently pass this to
      our show_object_fn callbacks as two components: "a/b/" and
      "c". Callbacks which want the full value then call
      path_name(), which concatenates the two. But this is an
      inefficient interface; the path is a strbuf, and we could
      simply append "c" to it temporarily, then roll back the
      length, without creating a new copy.
      
      So we could improve this by teaching the callsites of
      path_name() this trick (and there are only 3). But we can
      also notice that no callback actually cares about the
      broken-down representation, and simply pass each callback
      the full path "a/b/c" as a string. The callback code becomes
      even simpler, then, as we do not have to worry about freeing
      an allocated buffer, nor rolling back our modification to
      the strbuf.
      
      This is theoretically less efficient, as some callbacks
      would not bother to format the final path component. But in
      practice this is not measurable. Since we use the same
      strbuf over and over, our work to grow it is amortized, and
      we really only pay to memcpy a few bytes.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      de1e67d0
    • Jeff King's avatar
      list-objects: drop name_path entirely · bd64516a
      Jeff King authored
      In the previous commit, we left name_path as a thin wrapper
      around a strbuf. This patch drops it entirely. As a result,
      every show_object_fn callback needs to be adjusted. However,
      none of their code needs to be changed at all, because the
      only use was to pass it to path_name(), which now handles
      the bare strbuf.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      bd64516a
  15. 08 Oct, 2015 1 commit
  16. 25 May, 2015 2 commits
  17. 20 Apr, 2015 1 commit
    • Jeff King's avatar
      reachable: only mark local objects as recent · 1385bb7b
      Jeff King authored
      When pruning and repacking a repository that has an
      alternate object store configured, we may traverse a large
      number of objects in the alternate. This serves no purpose,
      and may be expensive to do. A longer explanation is below.
      
      Commits d3038d22 and abcb8655 taught prune and pack-objects
      (respectively) to treat "recent" objects as tips for
      reachability, so that we keep whole chunks of history. They
      built on the object traversal in 660c889e (sha1_file: add
      for_each iterators for loose and packed objects,
      2014-10-15), which covers both local and alternate objects.
      
      In both cases, covering alternate objects is unnecessary, as
      both commands can only drop objects from the local
      repository. In the case of prune, we traverse only the local
      object directory. And in the case of repacking, while we may
      or may not include local objects in our pack, we will never
      reach into the alternate with "repack -d". The "-l" option
      is only a question of whether we are migrating objects from
      the alternate into our repository, or leaving them
      untouched.
      
      It is possible that we may drop an object that is depended
      upon by another object in the alternate. For example,
      imagine two repositories, A and B, with A pointing to B as
      an alternate. Now imagine a commit that is in B which
      references a tree that is only in A. Traversing from recent
      objects in B might prevent A from dropping that tree. But
      this case isn't worth covering. Repo B should take
      responsibility for its own objects. It would never have had
      the commit in the first place if it did not also have the
      tree, and assuming it is using the same "keep recent chunks
      of history" scheme, then it would itself keep the tree, as
      well.
      
      So checking the alternate objects is not worth doing, and
      come with a significant performance impact. In both cases,
      we skip any recent objects that have already been marked
      SEEN (i.e., that we know are already reachable for prune, or
      included in the pack for a repack). So there is a slight
      waste of time in opening the alternate packs at all, only to
      notice that we have already considered each object. But much
      worse, the alternate repository may have a large number of
      objects that are not reachable from the local repository at
      all, and we end up adding them to the traversal.
      
      We can fix this by considering only local unseen objects.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      1385bb7b
  18. 19 Oct, 2014 1 commit
  19. 16 Oct, 2014 5 commits
    • Jeff King's avatar
      pack-objects: match prune logic for discarding objects · abcb8655
      Jeff King authored
      A recent commit taught git-prune to keep non-recent objects
      that are reachable from recent ones. However, pack-objects,
      when loosening unreachable objects, tries to optimize out
      the write in the case that the object will be immediately
      pruned. It now gets this wrong, since its rule does not
      reflect the new prune code (and this can be seen by running
      t6501 with a strategically placed repack).
      
      Let's teach pack-objects similar logic.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      abcb8655
    • Jeff King's avatar
      prune: keep objects reachable from recent objects · d3038d22
      Jeff King authored
      Our current strategy with prune is that an object falls into
      one of three categories:
      
        1. Reachable (from ref tips, reflogs, index, etc).
      
        2. Not reachable, but recent (based on the --expire time).
      
        3. Not reachable and not recent.
      
      We keep objects from (1) and (2), but prune objects in (3).
      The point of (2) is that these objects may be part of an
      in-progress operation that has not yet updated any refs.
      
      However, it is not always the case that objects for an
      in-progress operation will have a recent mtime. For example,
      the object database may have an old copy of a blob (from an
      abandoned operation, a branch that was deleted, etc). If we
      create a new tree that points to it, a simultaneous prune
      will leave our tree, but delete the blob. Referencing that
      tree with a commit will then work (we check that the tree is
      in the object database, but not that all of its referred
      objects are), as will mentioning the commit in a ref. But
      the resulting repo is corrupt; we are missing the blob
      reachable from a ref.
      
      One way to solve this is to be more thorough when
      referencing a sha1: make sure that not only do we have that
      sha1, but that we have objects it refers to, and so forth
      recursively. The problem is that this is very expensive.
      Creating a parent link would require traversing the entire
      object graph!
      
      Instead, this patch pushes the extra work onto prune, which
      runs less frequently (and has to look at the whole object
      graph anyway). It creates a new category of objects: objects
      which are not recent, but which are reachable from a recent
      object. We do not prune these objects, just like the
      reachable and recent ones.
      
      This lets us avoid the recursive check above, because if we
      have an object, even if it is unreachable, we should have
      its referent. We can make a simple inductive argument that
      with this patch, this property holds (that there are no
      objects with missing referents in the repository):
      
        0. When we have no objects, we have nothing to refer or be
           referred to, so the property holds.
      
        1. If we add objects to the repository, their direct
           referents must generally exist (e.g., if you create a
           tree, the blobs it references must exist; if you create
           a commit to point at the tree, the tree must exist).
           This is already the case before this patch. And it is
           not 100% foolproof (you can make bogus objects using
           `git hash-object`, for example), but it should be the
           case for normal usage.
      
           Therefore for any sequence of object additions, the
           property will continue to hold.
      
        2. If we remove objects from the repository, then we will
           not remove a child object (like a blob) if an object
           that refers to it is being kept. That is the part
           implemented by this patch.
      
           Note, however, that our reachability check and the
           actual pruning are not atomic. So it _is_ still
           possible to violate the property (e.g., an object
           becomes referenced just as we are deleting it). This
           patch is shooting for eliminating problems where the
           mtimes of dependent objects differ by hours or days,
           and one is dropped without the other. It does nothing
           to help with short races.
      
      Naively, the simplest way to implement this would be to add
      all recent objects as tips to the reachability traversal.
      However, this does not perform well. In a recently-packed
      repository, all reachable objects will also be recent, and
      therefore we have to look at each object twice. This patch
      instead performs the reachability traversal, then follows up
      with a second traversal for recent objects, skipping any
      that have already been marked.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      d3038d22
    • Jeff King's avatar
      reachable: mark index blobs as SEEN · 37254279
      Jeff King authored
      When we mark all reachable objects for pruning, that
      includes blobs mentioned by the index. However, we do not
      mark these with the SEEN flag, as we do for objects that we
      find by traversing (we also do not add them to the pending
      list, but that is because there is nothing further to
      traverse with them).
      
      This doesn't cause any problems with prune, because it
      checks only that the object exists in the global object
      hash, and not its flags. However, let's mark these objects
      to be consistent and avoid any later surprises.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      37254279
    • Jeff King's avatar
      reachable: reuse revision.c "add all reflogs" code · 718ccc97
      Jeff King authored
      We want to add all reflog entries as tips for finding
      reachable objects. The revision machinery can already do
      this (to support "rev-list --reflog"); we can reuse that
      code.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      718ccc97
    • Jeff King's avatar
      reachable: use traverse_commit_list instead of custom walk · 5f78a431
      Jeff King authored
      To find the set of reachable objects, we add a bunch of
      possible sources to our rev_info, call prepare_revision_walk,
      and then launch into a custom walker that handles each
      object top. This is a subset of what traverse_commit_list
      does, so we can just reuse that code (it can also handle
      more complex cases like UNINTERESTING commits and pathspecs,
      but we don't use those features).
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      5f78a431
  20. 03 Sep, 2014 1 commit
  21. 06 Jun, 2013 1 commit
    • Jeff King's avatar
      clear parsed flag when we free tree buffers · 6e454b9a
      Jeff King authored
      Many code paths will free a tree object's buffer and set it
      to NULL after finishing with it in order to keep memory
      usage down during a traversal. However, out of 8 sites that
      do this, only one actually unsets the "parsed" flag back.
      Those sites that don't are setting a trap for later users of
      the tree object; even after calling parse_tree, the buffer
      will remain NULL, causing potential segfaults.
      
      It is not known whether this is triggerable in the current
      code. Most commands do not do an in-memory traversal
      followed by actually using the objects again. However, it
      does not hurt to be safe for future callers.
      
      In most cases, we can abstract this out to a
      "free_tree_buffer" helper. However, there are two
      exceptions:
      
        1. The fsck code relies on the parsed flag to know that we
           were able to parse the object at one point. We can
           switch this to using a flag in the "flags" field.
      
        2. The index-pack code sets the buffer to NULL but does
           not free it (it is freed by a caller). We should still
           unset the parsed flag here, but we cannot use our
           helper, as we do not want to free the buffer.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      6e454b9a
  22. 17 Mar, 2013 1 commit
    • Jeff King's avatar
      use parse_object_or_die instead of die("bad object") · f7892d18
      Jeff King authored
      Some call-sites do:
      
        o = parse_object(sha1);
        if (!o)
      	  die("bad object %s", some_name);
      
      We can now handle that as a one-liner, and get more
      consistent output.
      
      In the third case of this patch, it looks like we are losing
      information, as the existing message also outputs the sha1
      hex; however, parse_object will already have written a more
      specific complaint about the sha1, so there is no point in
      repeating it here.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      f7892d18
  23. 08 Nov, 2011 2 commits
  24. 22 Mar, 2011 1 commit
  25. 30 Aug, 2010 1 commit
  26. 09 Apr, 2009 1 commit
    • Björn Steinbrink's avatar
      process_{tree,blob}: Remove useless xstrdup calls · de551d47
      Björn Steinbrink authored
      The name of the processed object was duplicated for passing it to
      add_object(), but that already calls path_name, which allocates a new
      string anyway. So the memory allocated by the xstrdup calls just went
      nowhere, leaking memory.
      
      This reduces the RSS usage for a "rev-list --all --objects" by about 10% on
      the gentoo repo (fully packed) as well as linux-2.6.git:
      
          gentoo:
                          | old           | new
          ----------------|-------------------------------
          RSS             |       1537284 |       1388408
          VSZ             |       1816852 |       1667952
          time elapsed    |       1:49.62 |       1:48.99
          min. page faults|        417178 |        379919
      
          linux-2.6.git:
                          | old           | new
          ----------------|-------------------------------
          RSS             |        324452 |        292996
          VSZ             |        491792 |        460376
          time elapsed    |       0:14.53 |       0:14.28
          min. page faults|         89360 |         81613
      Signed-off-by: 's avatarBjörn Steinbrink <B.Steinbrink@gmx.de>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      de551d47
  27. 19 Feb, 2008 2 commits