Draft: Compress zero runs (!126) · Merge requests · Nir Soffer / blkhash

Nir Soffer requested to merge compress-zero-runs into master Apr 29, 2023

Instead of hashing blocks, hash chunks of blocks. A chunk can contain zero or more blocks.

The image is converted to stream of chunks, compressing consecutive zero blocks to one chunk. The end of the stream is marked with a special last chunk with no blocks.

This image:

block 0 (data)
block 1 (zero)
block 2 (zero)
block 3 (zero)

Is compressed to:

chunk 0 (data, 1 block)
chunk 1 (zero, 3 blocks)
chunk 2 (last, 0 blocks)

A data chunk always contains one block. A zero chunk can contain one or more zero blocks. Finally, there is the special final chunk with zero blocks.

Data chunk format:

index - 8 bytes (block index)
count - 8 bytes (always 1)
flags - 1 byte  (always 0)
hash - 32 bytes (for sha256)

Zero chunk format:

index - 8 bytes (index of first block in chunk)
count - 8 bytes (number of zero blocks in chunk)
flags - 1 byte (always 1)

Last chunk format:

index - 8 bytes (last index + 1)
count - 8 bytes (always 0)
flags - 1 bytes (always 255)

Since this is constant time for any number of blocks, we don't need to batch zero blocks and send zero blocks to all workers from time to time. Actually we cannot do this since this will create different hash based on when we send zero block to all workers.

With this change we hash extra 17 bytes per data block per stream, but this work is negligible, since we need to hash 64k bytes for every data block.

For zero blocks runs we hash 17 bytes per each run. In the worse case when every zero run has one block we hash 17 bytes per block instead of 32 bytes per block. In the best case, empty image, we hash 17 bytes per stream.

The final block adds 17 bytes per stream, also negligible.

Draft: Compress zero runs

Merge request reports