ml_model package type does not support directories. However, models artifacts often have a file structure, and tooling used to generate the packages not always allow the package structure to be changed.
@10io By working on this I could make this feature package-type agnostic and make directory structure available for generic packages as well (#342854 (closed)). Any technical concerns on going for this approach?
A problem I can think of: since there's another table, we will need to query that when fetching PackageFiles, so there is a potential hit on performance. If we had something like #417983 (closed) in place, we could make a solution where only packages that actually want/need this feature use it.
@10io I will start working on this issue, my approach for now is creating a side table that holds the original name of the file, while PackageFile will hold a nickname to the file (similar to how Maven does, perhaps using the md5 as the file name or something on the lines). When fetching package files, if the package is ml_model, we would then replace . This would mean no changes on the frontend of how the files are displayed, only backend. Goal would be to allow files with file name Directory/My Subirectory/file.txt. WDYT? Any other recommended approach?
Similar but Maven does it with a constraint, it uses a single unique path. In other words, Maven clients will upload files to the same "path": My/Awesome/Package/1.2.3/<file_name>. Hence, the path is stored at the package level and not the package file level.
Here, things are slightly different as we can have different paths in the same package = this "path" information should be available at the package file level.
If I understand correctly, the package file file_name would stay the same but we would record the "path" on a side table? Is that accurate?
We could have issues with the package cleanup policies. Basically, that rule will use the file_name in the package files table to detect duplicates and remove those that are unwanted.
If my understanding above is correct, the backend could detect incorrect duplicates for files that are similarly named but have all different "paths".
Not sure what would be the alternative here. Storing the path directly in the file_name field, will not work, right?
If I understand correctly, the package file file_name would stay the same but we would record the "path" on a side table? Is that accurate?
In my mind the file_name in the package file table would be an md5 or something like that, while the side table would hold the entire path+file_name. Another alternative is to set the file_name as a enconded version of path+file_name, eg path/to/My File.py would be path%2Fto%2FMy%20File.py. This would avoid the needed of a secondary table, and in turn would allow every other package type to use the feature as well.
It doesn't change anything in the data structure, and should still be friendly to clean up policies since the name of the file is still unique. I need to properly fix the endpoint as to not force users to encode the file name themselves (http://localhost:3000/api/v4/projects/19/packages/ml_models/model_1/1.0.0/path/to/my_file.md should work)
What happens if you go to the package registry UI and try to download the file from the browser?
It downloads the urlified name of the package file. I don't think we have support for downloading folders do we? We could download with just the file_name and ignore the rest
Perhaps, this could work with a small update: split the file_name with / (after decoding) and use the last term as the file name for the download link in the UI.
I like the simplicity of having the entire path in the file_name. Although this is another argument for: "why, why carrierwave doesn't use two distinct columns?". One for the object storage key and one for the actual filename we want to use (yes, exactly as ActiveStorage does it)
Bonus: current cleanup background jobs will still work as expected.
Still, I think we should seek confirmation from Product that the UX is ok. In this part, I think we need @trizzi 's point of view.
@10io how complex would it be to allow everything to be downloaded as a folder (add files to a zipped folder), I assume we have nothing currently in place right?
The closest feature would those package formats that work with git tags (golang or composer for example). In those cases, their package manager CLIs could ask for an archive file. In those cases, we leverage existing helper functions in workhorse. If I'm not wrong, it's basically $ git archive.
What are the requirements that you have? Is it that users should be able to download the entire set of files below a generic package?
Is it that users should be able to download the entire set of files below a generic package?
Yep, more specifically for the ml_model packages. So instead of downloading at packages/ml_models/package_name/package_version/package_file I would just do one call for packages/ml_models/package_name/package_version and pull everything
@eduardobonet I would say it's not impossible to have a background job enqueued when a package file is added or removed. That job would create or update the package_name+package_version zip file.
Main concern is memory usage. We need to download the existing zip file (if any) to add a new entry and upload it (overwriting the old one). Again, it's not impossible but we need to be extra cautious. The other concern is could we run into large archive issues (imagine a package with thousands of files and each is 5GB). Perhaps, we need to have a guard here. We could read all package files sizes to sum them and if it's over a threshold, we don't generate the zip file.
@trizzi, yeah I would figure. I think downloading separate files is fine for now. What do you think of the approach in this MR? Any concerns product-wise?