Add a data registry for MLOps and arbitrary blob storage
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Proposal
I use DVC to manage machine learning projects including dataset management, workflow orchestration, and experiment management. DVC's data management is in essence similar to Git LFS (for storage) and Git submodules (for cross-repository referencing and importing), but it is tightly integrated into DVC's workflow orchestration and experiment management which leads to a smooth UX across the ML lifecycle without* additional infrastructure.
*Many users of DVC rely on an S3-compatible remote storage which requires either a service account or complex identity federation to perform authentication across system boundaries (e.g. between GitLab CI and the DVC remote storage). But DVC also supports a generic HTTP backend which can be configured to leverage (or rather "abuse") GitLab's generic packages repository:
The DVC cache structure is created locally in the .dvc/cache directory and replicated in the DVC remote when data is pushed there. With this in mind, the cache structure is mapped to the generic packages repository API URL path template
/projects/:id/packages/generic/:package_name/:package_version/:file_name
as follows:
-
:package_name:dvc(or any other arbitrary name) -
:package_version: The first 2 characters of a file's content hash -
:file_name: The remaining 30 characters of a file's content hash
This way, source code on GitLab (including *.dvc metadata files) and the actual data can be colocated in the same GitLab project which is beneficial with regard to user/access management (via GitLab project membership management, deploy tokens, group/project access tokens, $CI_JOB_TOKEN during CI), thereby avoiding authentication across system boundaries.
However, there are a few problems with this approach:
-
It's an abuse of the generic packages repository. Well, I can live with it, but there are some real downsides.
-
The size limit of the generic packages repository is designed for packages. For instance, gitlab.com's per-file size limit is 5GB. For data, it is quite strict. Although large files should be avoided with DVC to facilitate data deduplication, it may be prohibitive.
-
The rate limit of the generic packages repository is quite strict for uploading or downloading a large number of files in parallel and in short time. For instance, gitlab.com's rate limit is 2000 requests per minute for authenticated API requests and 1000 requests per minute for unauthenticated API requests. With a large number of small files, e.g. a typical image classification dataset structure, this rate limit is prohibitive because the effective bandwidth is bounded by the rate limit.
-
The per-project size limit is prohibitive for real-world projects that include machine learning datasets (including their revisions). For instance, gitlab.com's project size limit is 10GB. Resolving this problem might not (only) be a technical topic but (probably predominantly) a business topic, e.g. gitlab.com could have a different quota for data storage and charge for it independently instead of enforcing a hard limit.
-
GitLab's generic packages repository is polluted with DVC's internal cache structure (content-addressable structure). When browsing the generic packages repository via the GUI, mainly non-readable packages/files are shown and actual packages are difficult to find.
This problem might be addressed by extending the filter capabilities in the GUI.
-
DVC's run-cache cannot be pushed to the GitLab-based DVC remote because its cache layout has an additional nesting level compared to DVC's main cache, GitLab's generic packages repository allows only for a fixed number of nesting levels (
:package_name/:package_version/:file_name).
I suggest to add a new registry: "Data Registry". It would be similar to the generic packages repository but with arbitrary nesting levels (i.e. an arbitrary number of virtual subdirectories). As mentioned above, different (less strict) file size limits should apply, and the storage consumption should have a separate quota which doesn't contribute to the project size limit (10GB on gitlab.com). A new menu item would be added to the Deploy menu:
Since the "Data Registry" would be technically similar to the generic packages repository, it should be relatively straightforward to add, perhaps the implementation of the generic packages repository could be reused.
By adding a "Data Registry", GitLab would make another step towards MLOps and support a highly integrated and Git-driven ML development workflow using DVC.
What do you think?

