Storing .zip archive in Object Storage vs. Individual resources

Moving this slack discussion (internal) to an issue so we can properly track it. Edited to relevant discussion and formatted for legibility.

vshushlin kamil do you remember an argument in favor of using zip archives in object storage vs just writing individual resources to S3? I know we've discussed it many times, but there were too many discussions. cc Nicole Williams

kamil It is hard for me to remember exact argument, but we decided to try zip as a good alternative that does not require us to expand all individual files, where for pages is usually plenty. Given that pages is something that does change quite frequently this would be pretty expensive operation to perform. Additional argument we get GitLab Pages Review Apps for free with the zip

jacobvosmaer I was wondering too why we want to use zip files instead of putting each file in object storage as an individual blob. I'm not sure I understand the arguments in favor. zip files minimize storage at the cost of slower individual file retrieval blob-per-file optimizes file retrieval speed

kamil jacobvosmaer The retrieval time is basically the same (+ the cost of decompression) if you have a mapping. You read linearly from the specific offset to a given size. (edited)

jacobvosmaer so now you need a mapping, with blob per file you don't need a mapping

kamil yes, but at the cost of managing a number of very small files

jacobvosmaer once per deploy, and then every time a user accesses the site, they benefit

kamil yes, if you assume that you want to have a single page deployed per project (then maybe it is manageable), but there’s a very long outstanding issue about GitLab Pages Review Apps 🙂 (edited)

jacobvosmaer so we're making the migration of gitlab.com to kubernetes more complex (zip files) because of pages review apps?

kamil I don’t think that extracting individual files and managing them is simpler than serving from zip directly. Currently, we do it on FS which is “fairly” fast due to it being FS, but doing that on OS and having a consistent implementation that works in two styles (FS and OS) I don’t think is simpler. The current approach is clumsy on many layers, probably changing some aspects could make it slightly better. So, I don’t see it being more complex, but rather fundamentally different that solves the problem differently.

Now, if we could use existing artifact, it removes a ton of complexity by adding a very moderate of size source to Pages to serve from ZIP. There is also a problem with data structure on NFS: I don’t see an easy way to migrate existing NFS anyway (mostly due to symlink remapping that we would have to write our own tool for).

Anyway, this multidimensional problem and serving directly from OS vs ZIP is a very small fraction of the bigger problem of handling of data migration and process complexity 🙂

kamil jacobvosmaer This could be a way to release it. After step one we could pretty quickly (I hope) disconnect Pages NFS 🙂 Add ZIP handler to Pages Ensure that ZIP handler works properly, serve from it using existing artifacts, remove the pages:deploy job as we no longer need to deploy pages on Sidekiq Disconnect NFS storage from Rails/Sidekiq and consider the NFS read-only Give people 6 months to migrate their pages, but continue serving Pages from ZIP or NFS (as last resort) Disconnect NFS storage

@jacobvosmaer-gitlab @ayufan @vshushlin

Conclusion

After multiple PoCs, discussions and meetings, it was decided to move forward with zip archives over individual resources. A summary of the decision can be found here #437 (comment 398847743)

Edited Aug 19, 2020 by Jaime Martinez