Skip to content
  • Derrick Stolee's avatar
    revision: implement sparse algorithm · d5d2e935
    Derrick Stolee authored and Junio C Hamano's avatar Junio C Hamano committed
    
    
    When enumerating objects to place in a pack-file during 'git
    pack-objects --revs', we discover the "frontier" of commits
    that we care about and the boundary with commit we find
    uninteresting. From that point, we walk trees to discover which
    trees and blobs are uninteresting. Finally, we walk trees from the
    interesting commits to find the interesting objects that are
    placed in the pack.
    
    This commit introduces a new, "sparse" way to discover the
    uninteresting trees. We use the perspective of a single user trying
    to push their topic to a large repository. That user likely changed
    a very small fraction of the paths in their working directory, but
    we spend a lot of time walking all reachable trees.
    
    The way to switch the logic to work in this sparse way is to start
    caring about which paths introduce new trees. While it is not
    possible to generate a diff between the frontier boundary and all
    of the interesting commits, we can simulate that behavior by
    inspecting all of the root trees as a whole, then recursing down
    to the set of trees at each path.
    
    We already had taken the first step by passing an oidset to
    mark_trees_uninteresting_sparse(). We now create a dictionary
    whose keys are paths and values are oidsets. We consider the set
    of trees that appear at each path. While we inspect a tree, we
    add its subtrees to the oidsets corresponding to the tree entry's
    path. We also mark trees as UNINTERESTING if the tree we are
    parsing is UNINTERESTING.
    
    To actually improve the performance, we need to terminate our
    recursion. If the oidset contains only UNINTERESTING trees, then
    we do not continue the recursion. This avoids walking trees that
    are likely to not be reachable from interesting trees. If the
    oidset contains only interesting trees, then we will walk these
    trees in the final stage that collects the intersting objects to
    place in the pack. Thus, we only recurse if the oidset contains
    both interesting and UNINITERESTING trees.
    
    There are a few ways that this is not a universally better option.
    
    First, we can pack extra objects. If someone copies a subtree
    from one tree to another, the first tree will appear UNINTERESTING
    and we will not recurse to see that the subtree should also be
    UNINTERESTING. We will walk the new tree and see the subtree as
    a "new" object and add it to the pack. A test is modified to
    demonstrate this behavior and to verify that the new logic is
    being exercised.
    
    Second, we can have extra memory pressure. If instead of being a
    single user pushing a small topic we are a server sending new
    objects from across the entire working directory, then we will
    gain very little (the recursion will rarely terminate early) but
    will spend extra time maintaining the path-oidset dictionaries.
    
    Despite these potential drawbacks, the benefits of the algorithm
    are clear. By adding a counter to 'add_children_by_path' and
    'mark_tree_contents_uninteresting', I measured the number of
    parsed trees for the two algorithms in a variety of repos.
    
    For git.git, I used the following input:
    
    	v2.19.0
    	^v2.19.0~10
    
     Objects to pack: 550
    Walked (old alg): 282
    Walked (new alg): 130
    
    For the Linux repo, I used the following input:
    
    	v4.18
    	^v4.18~10
    
     Objects to pack:   518
    Walked (old alg): 4,836
    Walked (new alg):   188
    
    The two repos above are rather "wide and flat" compared to
    other repos that I have used in the past. As a comparison,
    I tested an old topic branch in the Azure DevOps repo, which
    has a much deeper folder structure than the Linux repo.
    
     Objects to pack:    220
    Walked (old alg): 22,804
    Walked (new alg):    129
    
    I used the number of walked trees the main metric above because
    it is consistent across multiple runs. When I ran my tests, the
    performance of the pack-objects command with the same options
    could change the end-to-end time by 10x depending on the file
    system being warm. However, by repeating the same test on repeat
    I could get more consistent timing results. The git.git and
    Linux tests were too fast overall (less than 0.5s) to measure
    an end-to-end difference. The Azure DevOps case was slow enough
    to see the time improve from 15s to 1s in the warm case. The
    cold case was 90s to 9s in my testing.
    
    These improvements will have even larger benefits in the super-
    large Windows repository. In our experiments, we see the
    "Enumerate objects" phase of pack-objects taking 60-80% of the
    end-to-end time of non-trivial pushes, taking longer than the
    network time to send the pack and the server time to verify the
    pack.
    
    Signed-off-by: default avatarDerrick Stolee <dstolee@microsoft.com>
    Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
    d5d2e935