Mirroring LFS objects

Moved from gitlab-org/gitlab-ce#24468

Good point. Syncing LFS objects is a bit of a challenge at scale, and we may need to think about how best we can support that.

/cc: @pcarranza, @regisF

@stanhu I think that we were aware of this and were already working on it in the GEO feature, right @regisF ?

@pcarranza @stanhu this is definitely something that Geo will address. We know we don't replicate LFS objects at the moment, or any other assets saved on disk for that matter.

The mirror repository functionality is something relatively simple at the moment. It mirrors everything git related. Now with LFS, we are now talking about replicating things saved on disk. This is a (very) hard problem as we can see on #846 (closed), and I don't think we will do this for this mirroring feature for LFS objects in the short term. Perhaps we can revisit this once we'll be done with the Geo project, see if we can support the LFS use case for mirroring repositories.

In the short term, we can display a warning message in the mirror configuration page when LFS is on for this project.

A warning message on the repo's main page would be helpfull indeed. It can be triggered by git pull if a pulled tree has signs of LFS usage. LFS may be introduced in a source repo after a while, so this is not just a one-off task. If the warning is only shown on the preferences page, no one will really see it.

If it feels difficult to implement a temporary warning as a new UI element, the existing repo description may work for that:

Mirrored from https://*****:*****@gitlab.example.com/group/repo.git. Updated about 42 hours ago. WARNING: LFS object are not copied.

The warning can be a link to this issue.

I agree with @kachkaev here, at least people should be informed until we have a proper solution.

Mentioned in issue #1290 (moved)

@kachkaev I've created https://gitlab.com/gitlab-org/gitlab-ee/issues/1290 to fix this. Thanks for your help!

Mentioned in issue #1291 (closed)

@regisF to me, mirroring LFS objects in a repo is not the same as full mirroring of all the files managed by Gitlab for Disaster Recovery or Geo instances. Moreover, Git LFS already manages this with its own protocol. The work to be done here is "merely" adding support for LFS to the gitlab-shell task that handles repository mirroring.

The side benefit that I see is that it would make LFS objects in a Gitswarm repository sync to Perforce on the backend. As it is right now, I've started writing a daemon standing between Gitswarm and the Git Fusion servers to clone then push. A bit stupid if you ask me

I looked at the code, and if I understand things correctly, some modifications are necessary in gitlab-shell to have a local copy of the LFS objects in the repo. Couldn't that be done in parallel to #846 (closed)?

Thanks,

Clément

@clement.moyroud

One of our developer (@jacobvosmaer-gitlab) recently said about this:

when you do 'git lfs pull' you only download the LFS objects that your current Git working directory knows about. But you may have just pulled commits that reference other LFS objects.

This is why we are not considering, at this point, doing a mirroring of LFS objects with git.

@regisF I asked the question to the Git LFS people over at GitHub, and it should Just Work^TM, with the two following caveats:

The source remote should be called origin for simplicity - this avoids having to turn off the smudge filter;
git push --all remote origin should be used to ensure all Git LFS objects are pushed.

I responded in https://github.com/git-lfs/git-lfs/issues/1762

Thanks @jacobvosmaer-gitlab! You're right that on a bare depot things are totally different because there are no files that the smudge filters can work on... The solution might be to have non-bare repos when there's LFS data, but this opens another can of worms

@clement.moyroud as I hinted in https://github.com/git-lfs/git-lfs/issues/1762#issuecomment-266737092 I don't think even having a non-bare repo would be enough for reliable LFS mirroring in GitLab.

The way we implemented LFS in GitLab it is strictly possible to extract a list of all LFS objects a project has access to, and then one could try to copy those files one by one to the other GitLab server. But this would not scale well (building that list requires scanning the entire (global!) table of LFS objects) and it would only work if the sending end is a GitLab server. This is because 'give me a list of LFS objects associated with this project' is not part of the LFS protocol. (And having said that I am not sure if it would be wise to have it in GitLab because of the massive table scan involved.)

Just for the record, our GitLab Geo product (still under development) will be able to replicate the global set of LFS objects from one GitLab server to another. But this is a different thing from mirroring a single Git+LFS repository from one GitLab (or GitHub or BitBucket) server to another. I may be wrong but it seems like the design of LFS is making it hard to do this in an efficient and comprehensive manner for a single repository. (Because 'comprehensive' forces you to scan all commits that are mirrored individually.)

How difficult would it be to add a table that stores a list of per-project LFS objects? The initial full-table scan would have to be done once, then operations would be much faster, right? If that makes any difference, we will be Gitswarm EE customers in a few weeks.

In any case, for the few projects that require it, according to the discussion on GH I should be fine with a daemon pushing & pulling, as long as it does it on non-bare repos, right?

So I tried the following script:

#!/bin/sh
so_repo=$1
si_repo=$2
reponame=$(basename "$so_repo")
git clone --mirror $so_repo
cd $reponame
git lfs fetch --all
git remote add sink $si_repo
echo "# Non-LFS objects to push to $si_repo:"
git push --dry-run sink '*:*'
echo "# Tags to push to $si_repo:"
git push --dry-run sink '*:*' --tags
echo "# Pushing non-LFS objects to $si_repo..."
git push sink '*:*'
git push sink '*:*' --tags
echo "# LFS objects to push to $si_repo:"
git lfs push --dry-run --all sink
echo "# Pushing LFS objects to $si_repo..."
git lfs push --all sink

Which works fine in my admittedly limited testing. I see two approaches here:

Per my comment just above, why not store the list of LFS objects that pertain to a project? This way you can avoid the table scan - it only happens once for the initial migration, then new LFS objects can be added to the index at the time of upload, right?
Create a daemon that's connected to Gitlab using per-project webhooks, pulling & pushing whenever there's an update.

The first approach would make it more integrated, but I understand that it might be too complicated.

I don't think implementing the second approach would be complicated, but the problem is that it requires an external daemon and increases storage usage since they are duplicated.

Let me know what you think.