1. 08 May, 2019 1 commit
  2. 04 Apr, 2019 1 commit
    • Jeff King's avatar
      revision: use a prio_queue to hold rewritten parents · 8320b1db
      Jeff King authored
      This patch fixes a quadratic list insertion in rewrite_one() when
      pathspec limiting is combined with --parents. What happens is something
      like this:
      
        1. We see that some commit X touches the path, so we try to rewrite
           its parents.
      
        2. rewrite_one() loops forever, rewriting parents, until it finds a
           relevant parent (or hits the root and decides there are none). The
           heavy lifting is done by process_parent(), which uses
           try_to_simplify_commit() to drop parents.
      
        3. process_parent() puts any intermediate parents into the
           &revs->commits list, inserting by commit date as usual.
      
      So if commit X is recent, and then there's a large chunk of history that
      doesn't touch the path, we may add a lot of commits to &revs->commits.
      And insertion by commit date is O(n) in the worst case, making the whole
      thing quadratic.
      
      We tried to deal with this long ago in fce87ae5 (Fix quadratic
      performance in rewrite_one., 2008-07-12). In that scheme, we cache the
      oldest commit in the list; if the new commit to be added is older, we
      can start our linear traversal there. This often works well in practice
      because parents are older than their descendants, and thus we tend to
      add older and older commits as we traverse.
      
      But this isn't guaranteed, and in fact there's a simple case where it is
      not: merges. Imagine we look at the first parent of a merge and see a
      very old commit (let's say 3 years old). And on the second parent, as we
      go back 3 years in history, we might have many commits. That one
      first-parent commit has polluted our oldest-commit cache; it will remain
      the oldest while we traverse a huge chunk of history, during which we
      have to fall back to the slow, linear method of adding to the list.
      
      Naively, one might imagine that instead of caching the oldest commit,
      we'd start at the last-added one. But that just makes some cases faster
      while making others slower (and indeed, while it made a real-world test
      case much faster, it does quite poorly in the perf test include here).
      Fundamentally, these are just heuristics; our worst case is still
      quadratic, and some cases will approach that.
      
      Instead, let's use a data structure with better worst-case performance.
      Swapping out revs->commits for something else would have repercussions
      all over the code base, but we can take advantage of one fact: for the
      rewrite_one() case, nobody actually needs to see those commits in
      revs->commits until we've finished generating the whole list.
      
      That leaves us with two obvious options:
      
        1. We can generate the list _unordered_, which should be O(n), and
           then sort it afterwards, which would be O(n log n) total. This is
           "sort-after" below.
      
        2. We can insert the commits into a separate data structure, like a
           priority queue. This is "prio-queue" below.
      
      I expected that sort-after would be the fastest (since it saves us the
      extra step of copying the items into the linked list), but surprisingly
      the prio-queue seems to be a bit faster.
      
      Here are timings for the new p0001.6 for all three techniques across a
      few repositories, as compared to master:
      
      master              cache-last                sort-after              prio-queue
      --------------------------------------------------------------------------------------------
      GIT_PERF_REPO=git.git
      0.52(0.50+0.02)      0.53(0.51+0.02)  +1.9%   0.37(0.33+0.03) -28.8%  0.37(0.32+0.04) -28.8%
      
      GIT_PERF_REPO=linux.git
      20.81(20.74+0.07)   20.31(20.24+0.07) -2.4%   0.94(0.86+0.07) -95.5%  0.91(0.82+0.09) -95.6%
      
      GIT_PERF_REPO=llvm-project.git
      83.67(83.57+0.09)    4.23(4.15+0.08) -94.9%   3.21(3.15+0.06) -96.2%  2.98(2.91+0.07) -96.4%
      
      A few items to note:
      
        - the cache-list tweak does improve the bad case for llvm-project.git
          that started my digging into this problem. But it performs terribly
          on linux.git, barely helping at all.
      
        - the sort-after and prio-queue techniques work well. They approach
          the timing for running without --parents at all, which is what you'd
          expect (see below for more data).
      
        - prio-queue just barely outperforms sort-after. As I said, I'm not
          really sure why this is the case, but it is. You can see it even
          more prominently in this real-world case on llvm-project.git:
      
            git rev-list --parents 07ef786652e7 -- llvm/test/CodeGen/Generic/bswap.ll
      
          where prio-queue routinely outperforms sort-after by about 7%. One
          guess is that the prio-queue may just be more efficient because it
          uses a compact array.
      
      There are three new perf tests:
      
        - "rev-list --parents" gives us a baseline for running with --parents.
          This isn't sped up meaningfully here, because the bad case is
          triggered only with simplification. But it's good to make sure we
          don't screw it up (now, or in the future).
      
        - "rev-list -- dummy" gives us a baseline for just traversing with
          pathspec limiting. This gives a lower bound for the next test (and
          it's also a good thing for us to be checking in general for
          regressions, since we don't seem to have any existing tests).
      
        - "rev-list --parents -- dummy" shows off the problem (and our fix)
      
      Here are the timings for those three on llvm-project.git, before and
      after the fix:
      
      Test                                 master              prio-queue
      ------------------------------------------------------------------------------
      0001.3: rev-list --parents           2.24(2.12+0.12)     2.22(2.11+0.11) -0.9%
      0001.5: rev-list -- dummy            2.89(2.82+0.07)     2.92(2.89+0.03) +1.0%
      0001.6: rev-list --parents -- dummy  83.67(83.57+0.09)   2.98(2.91+0.07) -96.4%
      
      Changes in the first two are basically noise, and you can see we
      approach our lower bound in the final one.
      
      Note that we can't fully get rid of the list argument from
      process_parents(). Other callers do have lists, and it would be hard to
      convert them. They also don't seem to have this problem (probably
      because they actually remove items from the list as they loop, meaning
      it doesn't grow so large in the first place). So this basically just
      drops the "cache_ptr" parameter (which was used only by the one caller
      we're fixing here) and replaces it with a prio_queue. Callers are free
      to use either data structure, depending on what they're prepared to
      handle.
      Reported-by: 's avatarBjörn Pettersson A <bjorn.a.pettersson@ericsson.com>
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      8320b1db
  3. 20 Mar, 2019 1 commit
  4. 11 Mar, 2019 1 commit
    • Jeff King's avatar
      line-log: detect unsupported formats · 05314efa
      Jeff King authored
      If you use "log -L" with an output format like "--raw" or "--stat",
      we'll silently ignore the format and just output the normal patch.
      Let's detect and complain about this, which at least tells the user
      what's going on.
      
      The tests here aren't exhaustive over the set of all formats, but it
      should at least let us know if somebody breaks the format-checking.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      05314efa
  5. 08 Feb, 2019 1 commit
    • Elijah Newren's avatar
      log,diff-tree: add --combined-all-paths option · d76ce4f7
      Elijah Newren authored
      The combined diff format for merges will only list one filename, even if
      rename or copy detection is active.  For example, with raw format one
      might see:
      
        ::100644 100644 100644 fabadb8 cc95eb0 4866510 MM	describe.c
        ::100755 100755 100755 52b7a2d 6d1ac04 d2ac7d7 RM	bar.sh
        ::100644 100644 100644 e07d6c5 9042e82 ee91881 RR	phooey.c
      
      This doesn't let us know what the original name of bar.sh was in the
      first parent, and doesn't let us know what either of the original names
      of phooey.c were in either of the parents.  In contrast, for non-merge
      commits, raw format does provide original filenames (and a rename score
      to boot).  In order to also provide original filenames for merge
      commits, add a --combined-all-paths option (which must be used with
      either -c or --cc, and is likely only useful with rename or copy
      detection active) so that we can print tab-separated filenames when
      renames are involved.  This transforms the above output to:
      
        ::100644 100644 100644 fabadb8 cc95eb0 4866510 MM	desc.c	desc.c	desc.c
        ::100755 100755 100755 52b7a2d 6d1ac04 d2ac7d7 RM	foo.sh	bar.sh	bar.sh
        ::100644 100644 100644 e07d6c5 9042e82 ee91881 RR	fooey.c	fuey.c	phooey.c
      
      Further, in patch format, this changes the from/to headers so that
      instead of just having one "from" header, we get one for each parent.
      For example, instead of having
      
        --- a/phooey.c
        +++ b/phooey.c
      
      we would see
      
        --- a/fooey.c
        --- a/fuey.c
        +++ b/phooey.c
      Signed-off-by: 's avatarElijah Newren <newren@gmail.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      d76ce4f7
  6. 17 Jan, 2019 2 commits
    • Derrick Stolee's avatar
      revision: implement sparse algorithm · d5d2e935
      Derrick Stolee authored
      When enumerating objects to place in a pack-file during 'git
      pack-objects --revs', we discover the "frontier" of commits
      that we care about and the boundary with commit we find
      uninteresting. From that point, we walk trees to discover which
      trees and blobs are uninteresting. Finally, we walk trees from the
      interesting commits to find the interesting objects that are
      placed in the pack.
      
      This commit introduces a new, "sparse" way to discover the
      uninteresting trees. We use the perspective of a single user trying
      to push their topic to a large repository. That user likely changed
      a very small fraction of the paths in their working directory, but
      we spend a lot of time walking all reachable trees.
      
      The way to switch the logic to work in this sparse way is to start
      caring about which paths introduce new trees. While it is not
      possible to generate a diff between the frontier boundary and all
      of the interesting commits, we can simulate that behavior by
      inspecting all of the root trees as a whole, then recursing down
      to the set of trees at each path.
      
      We already had taken the first step by passing an oidset to
      mark_trees_uninteresting_sparse(). We now create a dictionary
      whose keys are paths and values are oidsets. We consider the set
      of trees that appear at each path. While we inspect a tree, we
      add its subtrees to the oidsets corresponding to the tree entry's
      path. We also mark trees as UNINTERESTING if the tree we are
      parsing is UNINTERESTING.
      
      To actually improve the performance, we need to terminate our
      recursion. If the oidset contains only UNINTERESTING trees, then
      we do not continue the recursion. This avoids walking trees that
      are likely to not be reachable from interesting trees. If the
      oidset contains only interesting trees, then we will walk these
      trees in the final stage that collects the intersting objects to
      place in the pack. Thus, we only recurse if the oidset contains
      both interesting and UNINITERESTING trees.
      
      There are a few ways that this is not a universally better option.
      
      First, we can pack extra objects. If someone copies a subtree
      from one tree to another, the first tree will appear UNINTERESTING
      and we will not recurse to see that the subtree should also be
      UNINTERESTING. We will walk the new tree and see the subtree as
      a "new" object and add it to the pack. A test is modified to
      demonstrate this behavior and to verify that the new logic is
      being exercised.
      
      Second, we can have extra memory pressure. If instead of being a
      single user pushing a small topic we are a server sending new
      objects from across the entire working directory, then we will
      gain very little (the recursion will rarely terminate early) but
      will spend extra time maintaining the path-oidset dictionaries.
      
      Despite these potential drawbacks, the benefits of the algorithm
      are clear. By adding a counter to 'add_children_by_path' and
      'mark_tree_contents_uninteresting', I measured the number of
      parsed trees for the two algorithms in a variety of repos.
      
      For git.git, I used the following input:
      
      	v2.19.0
      	^v2.19.0~10
      
       Objects to pack: 550
      Walked (old alg): 282
      Walked (new alg): 130
      
      For the Linux repo, I used the following input:
      
      	v4.18
      	^v4.18~10
      
       Objects to pack:   518
      Walked (old alg): 4,836
      Walked (new alg):   188
      
      The two repos above are rather "wide and flat" compared to
      other repos that I have used in the past. As a comparison,
      I tested an old topic branch in the Azure DevOps repo, which
      has a much deeper folder structure than the Linux repo.
      
       Objects to pack:    220
      Walked (old alg): 22,804
      Walked (new alg):    129
      
      I used the number of walked trees the main metric above because
      it is consistent across multiple runs. When I ran my tests, the
      performance of the pack-objects command with the same options
      could change the end-to-end time by 10x depending on the file
      system being warm. However, by repeating the same test on repeat
      I could get more consistent timing results. The git.git and
      Linux tests were too fast overall (less than 0.5s) to measure
      an end-to-end difference. The Azure DevOps case was slow enough
      to see the time improve from 15s to 1s in the warm case. The
      cold case was 90s to 9s in my testing.
      
      These improvements will have even larger benefits in the super-
      large Windows repository. In our experiments, we see the
      "Enumerate objects" phase of pack-objects taking 60-80% of the
      end-to-end time of non-trivial pushes, taking longer than the
      network time to send the pack and the server time to verify the
      pack.
      Signed-off-by: 's avatarDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      d5d2e935
    • Derrick Stolee's avatar
      revision: add mark_tree_uninteresting_sparse · f1f5de44
      Derrick Stolee authored
      In preparation for a new algorithm that walks fewer trees when
      creating a pack from a set of revisions, create a method that
      takes an oidset of tree oids and marks reachable objects as
      UNINTERESTING.
      
      The current implementation uses the existing
      mark_tree_uninteresting to recursively walk the trees and blobs.
      This will walk the same number of trees as the old mechanism. To
      ensure that mark_tree_uninteresting walks the tree, we need to
      remove the UNINTERESTING flag before calling the method. This
      implementation will be replaced entirely in a later commit.
      
      There is one new assumption in this approach: we are also given
      the oids of the interesting trees. This implementation does not
      use those trees at the moment, but we will use them in a later
      rewrite of this method.
      Signed-off-by: 's avatarDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      f1f5de44
  7. 15 Jan, 2019 1 commit
    • brian m. carlson's avatar
      tree-walk: store object_id in a separate member · ea82b2a0
      brian m. carlson authored
      When parsing a tree, we read the object ID directly out of the tree
      buffer. This is normally fine, but such an object ID cannot be used with
      oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1,
      there may not be that many bytes to copy.
      
      Instead, store the object ID in a separate struct member. Since we can
      no longer efficiently compute the path length, store that information as
      well in struct name_entry. Ensure we only copy the object ID into the
      new buffer if the path length is nonzero, as some callers will pass us
      an empty path with no object ID following it, and we will not want to
      read past the end of the buffer.
      Signed-off-by: brian m. carlson's avatarbrian m. carlson <sandals@crustytoothpaste.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      ea82b2a0
  8. 14 Jan, 2019 2 commits
  9. 28 Dec, 2018 1 commit
    • Jonathan Tan's avatar
      revision: use commit graph in get_reference() · ec0c5798
      Jonathan Tan authored
      When fetching into a repository, a connectivity check is first made by
      check_exist_and_connected() in builtin/fetch.c that runs:
      
        git rev-list --objects --stdin --not --all --quiet <(list of objects)
      
      If the client repository has many refs, this command can be slow,
      regardless of the nature of the server repository or what is being
      fetched. A profiler reveals that most of the time is spent in
      setup_revisions() (approx. 60/63), and of the time spent in
      setup_revisions(), most of it is spent in parse_object() (approx.
      49/60). This is because setup_revisions() parses the target of every ref
      (from "--all"), and parse_object() reads the buffer of the object.
      
      Reading the buffer is unnecessary if the repository has a commit graph
      and if the ref points to a commit (which is typically the case). This
      patch uses the commit graph wherever possible; on my computer, when I
      run the above command with a list of 1 object on a many-ref repository,
      I get a speedup from 1.8s to 1.0s.
      
      Another way to accomplish this effect would be to modify parse_object()
      to use the commit graph if possible; however, I did not want to change
      parse_object()'s current behavior of always checking the object
      signature of the returned object.
      Signed-off-by: 's avatarJonathan Tan <jonathantanmy@google.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      ec0c5798
  10. 09 Dec, 2018 1 commit
    • Duy Nguyen's avatar
      Indent code with TABs · ec36c42a
      Duy Nguyen authored
      We indent with TABs and sometimes for fine alignment, TABs followed by
      spaces, but never all spaces (unless the indentation is less than 8
      columns). Indenting with spaces slips through in some places. Fix
      them.
      
      Imported code and compat/ are left alone on purpose. The former should
      remain as close as upstream as possible. The latter pretty much has
      separate maintainers, it's up to them to decide.
      Signed-off-by: Duy Nguyen's avatarNguyễn Thái Ngọc Duy <pclouds@gmail.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      ec36c42a
  11. 06 Dec, 2018 2 commits
  12. 19 Nov, 2018 1 commit
    • Duy Nguyen's avatar
      tree-walk.c: make tree_entry_interesting() take an index · 67022e02
      Duy Nguyen authored
      In order to support :(attr) when matching pathspec on a tree,
      tree_entry_interesting() needs to take an index (because
      git_check_attr() needs it). This is the preparation step for it. This
      also makes it clearer what index we fall back to when looking up
      attributes during an unpack-trees operation: the source index.
      
      This also fixes revs->pruning.repo initialization that should have
      been done in 2abf3503 (revision.c: remove implicit dependency on
      the_index - 2018-09-21). Without it, skip_uninteresting() will
      dereference a NULL pointer through this call chain
      
        get_revision(revs)
        get_revision_internal
        get_revision_1
        try_to_simplify_commit
        rev_compare_tree
        diff_tree_oid(..., &revs->pruning)
        ll_diff_tree_oid
        diff_tree_paths
        ll_diff_tree
        skip_uninteresting
      Signed-off-by: Duy Nguyen's avatarNguyễn Thái Ngọc Duy <pclouds@gmail.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      67022e02
  13. 02 Nov, 2018 4 commits
    • Jeff King's avatar
      rev-list: handle flags for --indexed-objects · b4cfcde4
      Jeff King authored
      When a traversal sees the --indexed-objects option, it adds
      all blobs and valid cache-trees from the index to the
      traversal using add_index_objects_to_pending(). But that
      function totally ignores its flags parameter!
      
      That means that doing:
      
        git rev-list --objects --indexed-objects
      
      and
      
        git rev-list --objects --not --indexed-objects
      
      produce the same output, because we ignore the UNINTERESTING
      flag when walking the index in the second example.
      
      Nobody noticed because this feature was added as a way for
      tools like repack to increase their coverage of reachable
      objects, meaning it would only be used like the first
      example above.
      
      But since it's user facing (and because the documentation
      describes it "as if the objects are listed on the command
      line"), we should make sure the negative case behaves
      sensibly.
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      b4cfcde4
    • Derrick Stolee's avatar
      revision.c: generation-based topo-order algorithm · b4542418
      Derrick Stolee authored
      The current --topo-order algorithm requires walking all
      reachable commits up front, topo-sorting them, all before
      outputting the first value. This patch introduces a new
      algorithm which uses stored generation numbers to
      incrementally walk in topo-order, outputting commits as
      we go. This can dramatically reduce the computation time
      to write a fixed number of commits, such as when limiting
      with "-n <N>" or filling the first page of a pager.
      
      When running a command like 'git rev-list --topo-order HEAD',
      Git performed the following steps:
      
      1. Run limit_list(), which parses all reachable commits,
         adds them to a linked list, and distributes UNINTERESTING
         flags. If all unprocessed commits are UNINTERESTING, then
         it may terminate without walking all reachable commits.
         This does not occur if we do not specify UNINTERESTING
         commits.
      
      2. Run sort_in_topological_order(), which is an implementation
         of Kahn's algorithm. It first iterates through the entire
         set of important commits and computes the in-degree of each
         (plus one, as we use 'zero' as a special value here). Then,
         we walk the commits in priority order, adding them to the
         priority queue if and only if their in-degree is one. As
         we remove commits from this priority queue, we decrement the
         in-degree of their parents.
      
      3. While we are peeling commits for output, get_revision_1()
         uses pop_commit on the full list of commits computed by
         sort_in_topological_order().
      
      In the new algorithm, these three steps correspond to three
      different commit walks. We run these walks simultaneously,
      and advance each only as far as necessary to satisfy the
      requirements of the 'higher order' walk. We know when we can
      pause each walk by using generation numbers from the commit-
      graph feature.
      
      Recall that the generation number of a commit satisfies:
      
      * If the commit has at least one parent, then the generation
        number is one more than the maximum generation number among
        its parents.
      
      * If the commit has no parent, then the generation number is one.
      
      There are two special generation numbers:
      
      * GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
        indicates that the commit is not stored in the commit-graph and
        the generation number was not previously calculated.
      
      * GENERATION_NUMBER_ZERO: this value (0) is a special indicator
        to say that the commit-graph was generated by a version of Git
        that does not compute generation numbers (such as v2.18.0).
      
      Since we use generation_numbers_enabled() before using the new
      algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
      However, the existence of GENERATION_NUMBER_INFINITY implies the
      following weaker statement than the usual we expect from
      generation numbers:
      
          If A and B are commits with generation numbers gen(A) and
          gen(B) and gen(A) < gen(B), then A cannot reach B.
      
      Thus, we will walk in each of our stages until the "maximum
      unexpanded generation number" is strictly lower than the
      generation number of a commit we are about to use.
      
      The walks are as follows:
      
      1. EXPLORE: using the explore_queue priority queue (ordered by
         maximizing the generation number), parse each reachable
         commit until all commits in the queue have generation
         number strictly lower than needed. During this walk, update
         the UNINTERESTING flags as necessary.
      
      2. INDEGREE: using the indegree_queue priority queue (ordered
         by maximizing the generation number), add one to the in-
         degree of each parent for each commit that is walked. Since
         we walk in order of decreasing generation number, we know
         that discovering an in-degree value of 0 means the value for
         that commit was not initialized, so should be initialized to
         two. (Recall that in-degree value "1" is what we use to say a
         commit is ready for output.) As we iterate the parents of a
         commit during this walk, ensure the EXPLORE walk has walked
         beyond their generation numbers.
      
      3. TOPO: using the topo_queue priority queue (ordered based on
         the sort_order given, which could be commit-date, author-
         date, or typical topo-order which treats the queue as a LIFO
         stack), remove a commit from the queue and decrement the
         in-degree of each parent. If a parent has an in-degree of
         one, then we add it to the topo_queue. Before we decrement
         the in-degree, however, ensure the INDEGREE walk has walked
         beyond that generation number.
      
      The implementations of these walks are in the following methods:
      
      * explore_walk_step and explore_to_depth
      * indegree_walk_step and compute_indegrees_to_depth
      * next_topo_commit and expand_topo_walk
      
      These methods have some patterns that may seem strange at first,
      but they are probably carry-overs from their equivalents in
      limit_list and sort_in_topological_order.
      
      One thing that is missing from this implementation is a proper
      way to stop walking when the entire queue is UNINTERESTING, so
      this implementation is not enabled by comparisions, such as in
      'git rev-list --topo-order A..B'. This can be updated in the
      future.
      
      In my local testing, I used the following Git commands on the
      Linux repository in three modes: HEAD~1 with no commit-graph,
      HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
      allows comparing the benefits we get from parsing commits from
      the commit-graph and then again the benefits we get by
      restricting the set of commits we walk.
      
      Test: git rev-list --topo-order -100 HEAD
      HEAD~1, no commit-graph: 6.80 s
      HEAD~1, w/ commit-graph: 0.77 s
        HEAD, w/ commit-graph: 0.02 s
      
      Test: git rev-list --topo-order -100 HEAD -- tools
      HEAD~1, no commit-graph: 9.63 s
      HEAD~1, w/ commit-graph: 6.06 s
        HEAD, w/ commit-graph: 0.06 s
      
      This speedup is due to a few things. First, the new generation-
      number-enabled algorithm walks commits on order of the number of
      results output (subject to some branching structure expectations).
      Since we limit to 100 results, we are running a query similar to
      filling a single page of results. Second, when specifying a path,
      we must parse the root tree object for each commit we walk. The
      previous benefits from the commit-graph are entirely from reading
      the commit-graph instead of parsing commits. Since we need to
      parse trees for the same number of commits as before, we slow
      down significantly from the non-path-based query.
      
      For the test above, I specifically selected a path that is changed
      frequently, including by merge commits. A less-frequently-changed
      path (such as 'README') has similar end-to-end time since we need
      to walk the same number of commits (before determining we do not
      have 100 hits). However, get the benefit that the output is
      presented to the user as it is discovered, much the same as a
      normal 'git log' command (no '--topo-order'). This is an improved
      user experience, even if the command has the same runtime.
      Helped-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      b4542418
    • Derrick Stolee's avatar
      commit/revisions: bookkeeping before refactoring · 5284fc5c
      Derrick Stolee authored
      There are a few things that need to move around a little before
      making a big refactoring in the topo-order logic:
      
      1. We need access to record_author_date() and
         compare_commits_by_author_date() in revision.c. These are used
         currently by sort_in_topological_order() in commit.c.
      
      2. Moving these methods to commit.h requires adding an author_date_slab
         declaration to commit.h. Consumers will need their own implementation.
      
      3. The add_parents_to_list() method in revision.c performs logic
         around the UNINTERESTING flag and other special cases depending
         on the struct rev_info. Allow this method to ignore a NULL 'list'
         parameter, as we will not be populating the list for our walk.
         Also rename the method to the slightly more generic name
         process_parents() to make clear that this method does more than
         add to a list (and no list is required anymore).
      Helped-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      5284fc5c
    • Derrick Stolee's avatar
      revision.c: begin refactoring --topo-order logic · f0d9cc41
      Derrick Stolee authored
      When running 'git rev-list --topo-order' and its kin, the topo_order
      setting in struct rev_info implies the limited setting. This means
      that the following things happen during prepare_revision_walk():
      
      * revs->limited implies we run limit_list() to walk the entire
        reachable set. There are some short-cuts here, such as if we
        perform a range query like 'git rev-list COMPARE..HEAD' and we
        can stop limit_list() when all queued commits are uninteresting.
      
      * revs->topo_order implies we run sort_in_topological_order(). See
        the implementation of that method in commit.c. It implies that
        the full set of commits to order is in the given commit_list.
      
      These two methods imply that a 'git rev-list --topo-order HEAD'
      command must walk the entire reachable set of commits _twice_ before
      returning a single result.
      
      If we have a commit-graph file with generation numbers computed, then
      there is a better way. This patch introduces some necessary logic
      redirection when we are in this situation.
      
      In v2.18.0, the commit-graph file contains zero-valued bytes in the
      positions where the generation number is stored in v2.19.0 and later.
      Thus, we use generation_numbers_enabled() to check if the commit-graph
      is available and has non-zero generation numbers.
      
      When setting revs->limited only because revs->topo_order is true,
      only do so if generation numbers are not available. There is no
      reason to use the new logic as it will behave similarly when all
      generation numbers are INFINITY or ZERO.
      
      In prepare_revision_walk(), if we have revs->topo_order but not
      revs->limited, then we trigger the new logic. It breaks the logic
      into three pieces, to fit with the existing framework:
      
      1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
         struct. We use the presence of this struct as a signal to use the
         new methods during our walk. In this patch, this method simply
         calls limit_list() and sort_in_topological_order(). In the future,
         this method will set up a new data structure to perform that logic
         in-line.
      
      2. next_topo_commit() provides get_revision_1() with the next topo-
         ordered commit in the list. Currently, this simply pops the commit
         from revs->commits.
      
      3. expand_topo_walk() provides get_revision_1() with a way to signal
         walking beyond the latest commit. Currently, this calls
         add_parents_to_list() exactly like the old logic.
      
      While this commit presents method redirection for performing the
      exact same logic as before, it allows the next commit to focus only
      on the new logic.
      Signed-off-by: 's avatarDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      f0d9cc41
  14. 23 Oct, 2018 1 commit
    • Matthew DeVore's avatar
      exclude-promisor-objects: declare when option is allowed · 669b1d2a
      Matthew DeVore authored
      The --exclude-promisor-objects option causes some funny behavior in at
      least two commands: log and blame. It causes a BUG crash:
      
      	$ git log --exclude-promisor-objects
      	BUG: revision.c:2143: exclude_promisor_objects can only be used
      	when fetch_if_missing is 0
      	Aborted
      	[134]
      
      Fix this such that the option is treated like any other unknown option.
      The commands that must support it are limited, so declare in those
      commands that the flag is supported. In particular:
      
      	pack-objects
      	prune
      	rev-list
      
      The commands were found by searching for logic which parses
      --exclude-promisor-objects outside of revision.c. Extra logic outside of
      revision.c is needed because fetch_if_missing must be turned on before
      revision.c sees the option or it will BUG-crash. The above list is
      supported by the fact that no other command is introspectively invoked
      by another command passing --exclude-promisor-object.
      Signed-off-by: 's avatarMatthew DeVore <matvore@google.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      669b1d2a
  15. 22 Oct, 2018 2 commits
  16. 06 Oct, 2018 1 commit
    • Matthew DeVore's avatar
      revision: mark non-user-given objects instead · 99c9aa95
      Matthew DeVore authored
      Currently, list-objects.c incorrectly treats all root trees of commits
      as USER_GIVEN. Also, it would be easier to mark objects that are
      non-user-given instead of user-given, since the places in the code
      where we access an object through a reference are more obvious than
      the places where we access an object that was given by the user.
      
      Resolve these two problems by introducing a flag NOT_USER_GIVEN that
      marks blobs and trees that are non-user-given, replacing USER_GIVEN.
      (Only blobs and trees are marked because this mark is only used when
      filtering objects, and filtering of other types of objects is not
      supported yet.)
      
      This fixes a bug in that git rev-list behaved differently from git
      pack-objects. pack-objects would *not* filter objects given explicitly
      on the command line and rev-list would filter. This was because the two
      commands used a different function to add objects to the rev_info
      struct. This seems to have been an oversight, and pack-objects has the
      correct behavior, so I added a test to make sure that rev-list now
      behaves properly.
      Signed-off-by: 's avatarMatthew DeVore <matvore@google.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      99c9aa95
  17. 21 Sep, 2018 5 commits
  18. 29 Aug, 2018 1 commit
    • Jeff King's avatar
      convert "oidcmp() == 0" to oideq() · 4a7e27e9
      Jeff King authored
      Using the more restrictive oideq() should, in the long run,
      give the compiler more opportunities to optimize these
      callsites. For now, this conversion should be a complete
      noop with respect to the generated code.
      
      The result is also perhaps a little more readable, as it
      avoids the "zero is equal" idiom. Since it's so prevalent in
      C, I think seasoned programmers tend not to even notice it
      anymore, but it can sometimes make for awkward double
      negations (e.g., we can drop a few !!oidcmp() instances
      here).
      
      This patch was generated almost entirely by the included
      coccinelle patch. This mechanical conversion should be
      completely safe, because we check explicitly for cases where
      oidcmp() is compared to 0, which is what oideq() is doing
      under the hood. Note that we don't have to catch "!oidcmp()"
      separately; coccinelle's standard isomorphisms make sure the
      two are treated equivalently.
      
      I say "almost" because I did hand-edit the coccinelle output
      to fix up a few style violations (it mostly keeps the
      original formatting, but sometimes unwraps long lines).
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      4a7e27e9
  19. 22 Aug, 2018 1 commit
    • Jeff King's avatar
      rev-list: make empty --stdin not an error · a12cbe23
      Jeff King authored
      When we originally did the series that contains 7ba82629
      (revision: add rev_input_given flag, 2017-08-02) the intent
      was that "git rev-list --stdin </dev/null" would similarly
      become a successful noop. However, an attempt at the time to
      do that did not work[1]. The problem is that rev_input_given
      serves two roles:
      
       - it tells rev-list.c that it should not error out
      
       - it tells revision.c that it should not have the "default"
         ref kick (e.g., "HEAD" in "git log")
      
      We want to trigger the former, but not the latter. This is
      technically possible with a single flag, if we set the flag
      only after revision.c's revs->def check. But this introduces
      a rather subtle ordering dependency.
      
      Instead, let's keep two flags: one to denote when we got
      actual input (which triggers both roles) and one for when we
      read stdin (which triggers only the first).
      
      This does mean a caller interested in the first role has to
      check both flags, but there's only one such caller. And any
      future callers might want to make the distinction anyway
      (e.g., if they care less about erroring out, and more about
      whether revision.c soaked up our stdin).
      
      In fact, we already keep such a flag internally in
      revision.c for this purpose, so this is really just exposing
      that to the caller (and the old function-local flag can go
      away in favor of our new one).
      
      [1] https://public-inbox.org/git/20170802223416.gwiezhbuxbdmbjzx@sigill.intra.peff.net/Helped-by: 's avatarJunio C Hamano <gitster@pobox.com>
      Signed-off-by: 's avatarJeff King <peff@peff.net>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      a12cbe23
  20. 13 Aug, 2018 1 commit
  21. 20 Jul, 2018 1 commit
  22. 16 Jul, 2018 1 commit
    • Jonathan Tan's avatar
      revision: tolerate promised targets of tags · dc0a13f6
      Jonathan Tan authored
      In handle_commit(), it is fatal for an annotated tag to point to a
      non-existent object. --exclude-promisor-objects should relax this rule
      and allow non-existent objects that are promisor objects, but this is
      not the case. Update handle_commit() to tolerate this situation.
      
      This was observed when cloning from a repository with an annotated tag
      pointing to a blob. The test included in this patch demonstrates this
      case.
      Signed-off-by: 's avatarJonathan Tan <jonathantanmy@google.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      dc0a13f6
  23. 09 Jul, 2018 1 commit
    • Jonathan Tan's avatar
      upload-pack: send refs' objects despite "filter" · a0c9016a
      Jonathan Tan authored
      A filter line in a request to upload-pack filters out objects regardless
      of whether they are directly referenced by a "want" line or not. This
      means that cloning with "--filter=blob:none" (or another filter that
      excludes blobs) from a repository with at least one ref pointing to a
      blob (for example, the Git repository itself) results in output like the
      following:
      
          error: missing object referenced by 'refs/tags/junio-gpg-pub'
      
      and if that particular blob is not referenced by a fetched tree, the
      resulting clone fails fsck because there is no object from the remote to
      vouch that the missing object is a promisor object.
      
      Update both the protocol and the upload-pack implementation to include
      all explicitly specified "want" objects in the packfile regardless of
      the filter specification.
      Signed-off-by: 's avatarJonathan Tan <jonathantanmy@google.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      a0c9016a
  24. 29 Jun, 2018 4 commits
  25. 21 May, 2018 1 commit
  26. 16 May, 2018 1 commit
    • Stefan Beller's avatar
      object-store: move object access functions to object-store.h · cbd53a21
      Stefan Beller authored
      This should make these functions easier to find and cache.h less
      overwhelming to read.
      
      In particular, this moves:
      - read_object_file
      - oid_object_info
      - write_object_file
      
      As a result, most of the codebase needs to #include object-store.h.
      In this patch the #include is only added to files that would fail to
      compile otherwise.  It would be better to #include wherever
      identifiers from the header are used.  That can happen later
      when we have better tooling for it.
      Signed-off-by: Stefan Beller's avatarStefan Beller <sbeller@google.com>
      Signed-off-by: 's avatarJunio C Hamano <gitster@pobox.com>
      cbd53a21