Skip to content

Add rubyzip refinement to speed up entry count

Matthias Käppler requested to merge 345673-rubyzip-fast-entry-count into master

What does this MR do and why?

In https://gitlab.com/gitlab-org/gitlab/-/issues/345673 we identified the need to provide a faster, more efficient way to count entries in a zip file.

This MR adds a small refinement to Zip::File (part of the rubyzip gem) that does just that. It operates by reading only the EOCD data structure, which contains an integer field with the entry count.

For a zip file with 1M entries, it improves performance by several orders of magnitude, as it runs with constant CPU and memory use, whereas iterating Central Directory entries is O(N) both for CPU and memory use.

More details with timing in https://gitlab.com/gitlab-org/gitlab/-/issues/345673#note_733321374

How to set up and validate locally

Run Rails console and paste this snippet:

module M
  using GemExtensions::Rubyzip::Refinements
  def fast_count(archive_path)
    Zip::File.entry_count(File.open(archive_path))
  end  
  extend self
end  
[8] pry(main)> M.fast_count '/home/git/gitlab/spec/fixtures/safe_zip/valid-simple.zip'
=> 7

Compare:

$ unzip -l spec/fixtures/safe_zip/valid-simple.zip
Archive:  spec/fixtures/safe_zip/valid-simple.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2019-01-17 16:30   public/
       12  2019-01-17 16:30   public/index.html
        0  2019-01-17 16:30   public/assets/
        0  2019-01-17 16:30   public/assets/image.png
        6  2019-01-17 16:30   public/images
        0  2019-01-17 16:30   source/
       12  2019-01-17 16:30   source/index.html
---------                     -------
       30                     7 files

Possible follow-ups

  • Write documentation for efficient use of zip files
  • Write a Cop that flags potentially harmful use of rubyzip APIs

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #345673

Edited by Corinna Gogolok

Merge request reports