squashfs becomes preferred image distribution method
The Tar Method
Currently the Charliecloud preferred method of working with containers is to create an image tarball, then untar the image directly into tmpfs for all compute nodes. This is a simple and well understood method, as most users are comfortable working with tarballs. This method does present some challenges. Distribution of container images at scale can be quite lengthy, and you must store your entire container image in memory.
The Squash Filesystem
The squash file system(squashfs) provides a network mounted container image for use with Charliecloud. The primary advantages of this are reducing container image distribution time, and keeping more memory free on compute nodes.
The squashfs appears to be a clear winner for distribution of container images for Charliecloud. There are several different approaches for using the squashfs.
- Kernel mounted squashfs
- Fuse mounted squashfs
In testing the kernel mounted squashfs made the container available to nodes and had the fastest execution time for the squashfs methods. The execution time for kernel mounted squashfs is typically within seconds compared to tmpfs. As the squashfs does not need to be copied onto each node, the total run time was always lower than that of the current preferred method. Exectution time was relatively unhindered when using the kernel mounted squashfs. So the overhead experienced in the situation is less than that of untarring to nodes.
Fuse mounted squashfs through squashfuse has also been tested. Initially, only the high level API was tested and proved to be competitive with tmpfs for 'typical' HPC container workflow. The combined time of untar plus execution was greater than the execution time for the fuse mounted squashfs. It is note worthy that squashfuse suffers on very intense meta data operations such as filesystem walks, and reading the entire contents of the container. The kernel mounted squashfs saw much less impact in these extreme scenarios. Additionally, I learned of a low level fuse API available through squashfuse if you have the correct tools installed during the build process for squashfuse. The squashfuse_ll mount sees a smaller performance hit than squashfuse does. It performed very similarly in speed to the kernel mounted squashfs for execution, and handles metadata operations much faster than squashfuse. This could be an ideal middle ground, as we could have almost kernel mount performance without the kernel mount.
What now
There are advantages, and disadvantages to the different mount methods. But I think moving to a squashfs solution as the new preferred method in some capacity for Charliecloud is the way to go.
Support kernel mounted squashfs
The top performer is the kernel mounted squashfs. Adopting this method would also require the most investment, and carries the most risk. There would be a need for Charliecloud to be able to mount the squashfs as root in some way. Some mechanism to mount the squashfs would likely need to be built into Charliecloud. The kernel mounted squashfs has little need for optimizations, it works well with little or no effort on the users behalf. This would also make it easier to support other filesystems in the future if need be. Squashfs is king today, but might not be in five years.
Don't support kernel mounted squashfs
The low level fuse API through squashfuse performs very well without escalation to root. The overhead of using squashfuse_ll is lower than the overhead of untarring a whole image into memory as well. squashfuse_ll scales very well in testing up to 1024 nodes. Existing tools would make integrating squashfs into the existing Charliecloud workflow rather easy, but it could mean adding those tools as a dependency for Charliecloud. With default settings for the block size of squashfs, and Lustre default 1MiB stripe size squashfuse_ll performs almost as well as the 'optimized' performance, which is very narrowly faster. At scales larger than 1024 those tunings may become more important. The high level fuse API saw more notable improvement in the optimized form. It is worth noting I also did not try a squashfs block size smaller than the default of 128k. That could provide more speed improvements. One thing that does provide clear gains is the Lustre stripe pattern as you scale up.
Overall it seems as though squashfs is an improvement for container use with Charliecloud. Both squashfuse and kernel mounted squashfs reduce the total job time for execution of containers. As scale increases the time it takes to untar increases, but squashfuse_ll and kernel mounted squashfs do not show any note worthy degradation of execution time.
The following graphs show the total execution times for the Pynamic Benchmark, with the squashfs stored in Lustre using different configurations of squashfs block size, lustre stripe pattern, and Lustre stripe size.
Note I've taken the graphs down temporarily
Questions
I would like input on this proposal, and am happy to answer any questions as best I can.