• Derrick Stolee's avatar
    revision.c: generation-based topo-order algorithm · b4542418
    Derrick Stolee authored
    The current --topo-order algorithm requires walking all
    reachable commits up front, topo-sorting them, all before
    outputting the first value. This patch introduces a new
    algorithm which uses stored generation numbers to
    incrementally walk in topo-order, outputting commits as
    we go. This can dramatically reduce the computation time
    to write a fixed number of commits, such as when limiting
    with "-n <N>" or filling the first page of a pager.
    
    When running a command like 'git rev-list --topo-order HEAD',
    Git performed the following steps:
    
    1. Run limit_list(), which parses all reachable commits,
       adds them to a linked list, and distributes UNINTERESTING
       flags. If all unprocessed commits are UNINTERESTING, then
       it may terminate without walking all reachable commits.
       This does not occur if we do not specify UNINTERESTING
       commits.
    
    2. Run sort_in_topological_order(), which is an implementation
       of Kahn's algorithm. It first iterates through the entire
       set of important commits and computes the in-degree of each
       (plus one, as we use 'zero' as a special value here). Then,
       we walk the commits in priority order, adding them to the
       priority queue if and only if their in-degree is one. As
       we remove commits from this priority queue, we decrement the
       in-degree of their parents.
    
    3. While we are peeling commits for output, get_revision_1()
       uses pop_commit on the full list of commits computed by
       sort_in_topological_order().
    
    In the new algorithm, these three steps correspond to three
    different commit walks. We run these walks simultaneously,
    and advance each only as far as necessary to satisfy the
    requirements of the 'higher order' walk. We know when we can
    pause each walk by using generation numbers from the commit-
    graph feature.
    
    Recall that the generation number of a commit satisfies:
    
    * If the commit has at least one parent, then the generation
      number is one more than the maximum generation number among
      its parents.
    
    * If the commit has no parent, then the generation number is one.
    
    There are two special generation numbers:
    
    * GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
      indicates that the commit is not stored in the commit-graph and
      the generation number was not previously calculated.
    
    * GENERATION_NUMBER_ZERO: this value (0) is a special indicator
      to say that the commit-graph was generated by a version of Git
      that does not compute generation numbers (such as v2.18.0).
    
    Since we use generation_numbers_enabled() before using the new
    algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
    However, the existence of GENERATION_NUMBER_INFINITY implies the
    following weaker statement than the usual we expect from
    generation numbers:
    
        If A and B are commits with generation numbers gen(A) and
        gen(B) and gen(A) < gen(B), then A cannot reach B.
    
    Thus, we will walk in each of our stages until the "maximum
    unexpanded generation number" is strictly lower than the
    generation number of a commit we are about to use.
    
    The walks are as follows:
    
    1. EXPLORE: using the explore_queue priority queue (ordered by
       maximizing the generation number), parse each reachable
       commit until all commits in the queue have generation
       number strictly lower than needed. During this walk, update
       the UNINTERESTING flags as necessary.
    
    2. INDEGREE: using the indegree_queue priority queue (ordered
       by maximizing the generation number), add one to the in-
       degree of each parent for each commit that is walked. Since
       we walk in order of decreasing generation number, we know
       that discovering an in-degree value of 0 means the value for
       that commit was not initialized, so should be initialized to
       two. (Recall that in-degree value "1" is what we use to say a
       commit is ready for output.) As we iterate the parents of a
       commit during this walk, ensure the EXPLORE walk has walked
       beyond their generation numbers.
    
    3. TOPO: using the topo_queue priority queue (ordered based on
       the sort_order given, which could be commit-date, author-
       date, or typical topo-order which treats the queue as a LIFO
       stack), remove a commit from the queue and decrement the
       in-degree of each parent. If a parent has an in-degree of
       one, then we add it to the topo_queue. Before we decrement
       the in-degree, however, ensure the INDEGREE walk has walked
       beyond that generation number.
    
    The implementations of these walks are in the following methods:
    
    * explore_walk_step and explore_to_depth
    * indegree_walk_step and compute_indegrees_to_depth
    * next_topo_commit and expand_topo_walk
    
    These methods have some patterns that may seem strange at first,
    but they are probably carry-overs from their equivalents in
    limit_list and sort_in_topological_order.
    
    One thing that is missing from this implementation is a proper
    way to stop walking when the entire queue is UNINTERESTING, so
    this implementation is not enabled by comparisions, such as in
    'git rev-list --topo-order A..B'. This can be updated in the
    future.
    
    In my local testing, I used the following Git commands on the
    Linux repository in three modes: HEAD~1 with no commit-graph,
    HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
    allows comparing the benefits we get from parsing commits from
    the commit-graph and then again the benefits we get by
    restricting the set of commits we walk.
    
    Test: git rev-list --topo-order -100 HEAD
    HEAD~1, no commit-graph: 6.80 s
    HEAD~1, w/ commit-graph: 0.77 s
      HEAD, w/ commit-graph: 0.02 s
    
    Test: git rev-list --topo-order -100 HEAD -- tools
    HEAD~1, no commit-graph: 9.63 s
    HEAD~1, w/ commit-graph: 6.06 s
      HEAD, w/ commit-graph: 0.06 s
    
    This speedup is due to a few things. First, the new generation-
    number-enabled algorithm walks commits on order of the number of
    results output (subject to some branching structure expectations).
    Since we limit to 100 results, we are running a query similar to
    filling a single page of results. Second, when specifying a path,
    we must parse the root tree object for each commit we walk. The
    previous benefits from the commit-graph are entirely from reading
    the commit-graph instead of parsing commits. Since we need to
    parse trees for the same number of commits as before, we slow
    down significantly from the non-path-based query.
    
    For the test above, I specifically selected a path that is changed
    frequently, including by merge commits. A less-frequently-changed
    path (such as 'README') has similar end-to-end time since we need
    to walk the same number of commits (before determining we do not
    have 100 hits). However, get the benefit that the output is
    presented to the user as it is discovered, much the same as a
    normal 'git log' command (no '--topo-order'). This is an improved
    user experience, even if the command has the same runtime.
    Helped-by: default avatarJeff King <peff@peff.net>
    Signed-off-by: default avatarDerrick Stolee <dstolee@microsoft.com>
    Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
    b4542418
object.h 6.3 KB