Skip to content
  • Vicent Marti's avatar
    pack-objects: implement bitmap writing · 7cc8f971
    Vicent Marti authored and Junio C Hamano's avatar Junio C Hamano committed
    
    
    This commit extends more the functionality of `pack-objects` by allowing
    it to write out a `.bitmap` index next to any written packs, together
    with the `.idx` index that currently gets written.
    
    If bitmap writing is enabled for a given repository (either by calling
    `pack-objects` with the `--write-bitmap-index` flag or by having
    `pack.writebitmaps` set to `true` in the config) and pack-objects is
    writing a packfile that would normally be indexed (i.e. not piping to
    stdout), we will attempt to write the corresponding bitmap index for the
    packfile.
    
    Bitmap index writing happens after the packfile and its index has been
    successfully written to disk (`finish_tmp_packfile`). The process is
    performed in several steps:
    
        1. `bitmap_writer_set_checksum`: this call stores the partial
           checksum for the packfile being written; the checksum will be
           written in the resulting bitmap index to verify its integrity
    
        2. `bitmap_writer_build_type_index`: this call uses the array of
           `struct object_entry` that has just been sorted when writing out
           the actual packfile index to disk to generate 4 type-index bitmaps
           (one for each object type).
    
           These bitmaps have their nth bit set if the given object is of
           the bitmap's type. E.g. the nth bit of the Commits bitmap will be
           1 if the nth object in the packfile index is a commit.
    
           This is a very cheap operation because the bitmap writing code has
           access to the metadata stored in the `struct object_entry` array,
           and hence the real type for each object in the packfile.
    
        3. `bitmap_writer_reuse_bitmaps`: if there exists an existing bitmap
           index for one of the packfiles we're trying to repack, this call
           will efficiently rebuild the existing bitmaps so they can be
           reused on the new index. All the existing bitmaps will be stored
           in a `reuse` hash table, and the commit selection phase will
           prioritize these when selecting, as they can be written directly
           to the new index without having to perform a revision walk to
           fill the bitmap. This can greatly speed up the repack of a
           repository that already has bitmaps.
    
        4. `bitmap_writer_select_commits`: if bitmap writing is enabled for
           a given `pack-objects` run, the sequence of commits generated
           during the Counting Objects phase will be stored in an array.
    
           We then use that array to build up the list of selected commits.
           Writing a bitmap in the index for each object in the repository
           would be cost-prohibitive, so we use a simple heuristic to pick
           the commits that will be indexed with bitmaps.
    
           The current heuristics are a simplified version of JGit's
           original implementation. We select a higher density of commits
           depending on their age: the 100 most recent commits are always
           selected, after that we pick 1 commit of each 100, and the gap
           increases as the commits grow older. On top of that, we make sure
           that every single branch that has not been merged (all the tips
           that would be required from a clone) gets their own bitmap, and
           when selecting commits between a gap, we tend to prioritize the
           commit with the most parents.
    
           Do note that there is no right/wrong way to perform commit
           selection; different selection algorithms will result in
           different commits being selected, but there's no such thing as
           "missing a commit". The bitmap walker algorithm implemented in
           `prepare_bitmap_walk` is able to adapt to missing bitmaps by
           performing manual walks that complete the bitmap: the ideal
           selection algorithm, however, would select the commits that are
           more likely to be used as roots for a walk in the future (e.g.
           the tips of each branch, and so on) to ensure a bitmap for them
           is always available.
    
        5. `bitmap_writer_build`: this is the computationally expensive part
           of bitmap generation. Based on the list of commits that were
           selected in the previous step, we perform several incremental
           walks to generate the bitmap for each commit.
    
           The walks begin from the oldest commit, and are built up
           incrementally for each branch. E.g. consider this dag where A, B,
           C, D, E, F are the selected commits, and a, b, c, e are a chunk
           of simplified history that will not receive bitmaps.
    
                A---a---B--b--C--c--D
                         \
                          E--e--F
    
           We start by building the bitmap for A, using A as the root for a
           revision walk and marking all the objects that are reachable
           until the walk is over. Once this bitmap is stored, we reuse the
           bitmap walker to perform the walk for B, assuming that once we
           reach A again, the walk will be terminated because A has already
           been SEEN on the previous walk.
    
           This process is repeated for C, and D, but when we try to
           generate the bitmaps for E, we can reuse neither the current walk
           nor the bitmap we have generated so far.
    
           What we do now is resetting both the walk and clearing the
           bitmap, and performing the walk from scratch using E as the
           origin. This new walk, however, does not need to be completed.
           Once we hit B, we can lookup the bitmap we have already stored
           for that commit and OR it with the existing bitmap we've composed
           so far, allowing us to limit the walk early.
    
           After all the bitmaps have been generated, another iteration
           through the list of commits is performed to find the best XOR
           offsets for compression before writing them to disk. Because of
           the incremental nature of these bitmaps, XORing one of them with
           its predecesor results in a minimal "bitmap delta" most of the
           time. We can write this delta to the on-disk bitmap index, and
           then re-compose the original bitmaps by XORing them again when
           loaded.
    
           This is a phase very similar to pack-object's `find_delta` (using
           bitmaps instead of objects, of course), except the heuristics
           have been greatly simplified: we only check the 10 bitmaps before
           any given one to find best compressing one. This gives good
           results in practice, because there is locality in the ordering of
           the objects (and therefore bitmaps) in the packfile.
    
         6. `bitmap_writer_finish`: the last step in the process is
    	serializing to disk all the bitmap data that has been generated
    	in the two previous steps.
    
    	The bitmap is written to a tmp file and then moved atomically to
    	its final destination, using the same process as
    	`pack-write.c:write_idx_file`.
    
    Signed-off-by: default avatarVicent Marti <tanoku@gmail.com>
    Signed-off-by: default avatarJeff King <peff@peff.net>
    Signed-off-by: default avatarJunio C Hamano <gitster@pobox.com>
    7cc8f971