Add support for git LFS data upload
Problem to solve
As a User I want to be able to upload to data repositories (ML Projects) data or binaries using Git LFS (instead of normal git).
We need to create an option in the settings tab to active git LFS for their repositories. For that, the pipelines also need to change and include (if LFS is opted in) the corresponding git commands for pushing and retrieving branches.
When you have large files in your repository and/or a lot of binaries, then it is advisable to use Git LFS. Git LFS uses pointers instead of the actual files when the files or file types are marked as LFS files. When a Git LFS file is pulled to your local repository, the file is sent through a filter which will replace the pointer with the actual file. The actual files are located on the remote server and the pulled actual files are located in a cache in your local repository. This means that your local repository will be limited in size, but the remote repository of course will contain all the actual files and differences.
Link to XD for settings will follow:
Link to XD for empty data repository will follow:
Proposal for Technical Solution
Description of the changes needed for the pipelines:
-
Implement Git LFS methods (merge, push, commit, checkout) in backend, very similar to the things done to support DVC. -
Change the pipeline workflow to add the commands for GIT-LFS to save data. For example for data operations it will track the new files with GIT-LFS instead of GIT.
Description of the changes in frontend:
-
Folder selector to add to files to git-lfs in the MLProject repository view.