I've been experimenting with mirroring from CE to EE and noticed that LFS objects did not mirror between the two GitLab instances. This makes mirroring partially useless, because if server A is gone or is simply unavailable at some network, the repo can no longer be pulled in full from a mirror on server B. It would be nice if the mirroring repo synced LFS objects into its own storage. If this is not supported by design, IMHO there should be quite a loud warning sign on top of the mirror saying that the repo contains stuff that depends on the availability of the orig. external LFS storage.
A similar process would be expected for mirroring with push behaviour. I.e. a EE instance that mirrors a repo to some other CE/EE-hosted repo should sync LFS objects at each push to the remote storage.
I believe that this issue is somewhat related to gitlab-org/gitlab-ce#19177, but they are not quite the same.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
@pcarranza@stanhu this is definitely something that Geo will address. We know we don't replicate LFS objects at the moment, or any other assets saved on disk for that matter.
The mirror repository functionality is something relatively simple at the moment. It mirrors everything git related. Now with LFS, we are now talking about replicating things saved on disk. This is a (very) hard problem as we can see on #846 (closed), and I don't think we will do this for this mirroring feature for LFS objects in the short term. Perhaps we can revisit this once we'll be done with the Geo project, see if we can support the LFS use case for mirroring repositories.
In the short term, we can display a warning message in the mirror configuration page when LFS is on for this project.
A warning message on the repo's main page would be helpfull indeed. It can be triggered by git pull if a pulled tree has signs of LFS usage. LFS may be introduced in a source repo after a while, so this is not just a one-off task. If the warning is only shown on the preferences page, no one will really see it.
If it feels difficult to implement a temporary warning as a new UI element, the existing repo description may work for that:
Mirrored from https://*****:*****@gitlab.example.com/group/repo.git. Updated about 42 hours ago. WARNING: LFS object are not copied.
@regisF to me, mirroring LFS objects in a repo is not the same as full mirroring of all the files managed by Gitlab for Disaster Recovery or Geo instances. Moreover, Git LFS already manages this with its own protocol. The work to be done here is "merely" adding support for LFS to the gitlab-shell task that handles repository mirroring.
The side benefit that I see is that it would make LFS objects in a Gitswarm repository sync to Perforce on the backend. As it is right now, I've started writing a daemon standing between Gitswarm and the Git Fusion servers to clone then push. A bit stupid if you ask me
I looked at the code, and if I understand things correctly, some modifications are necessary in gitlab-shell to have a local copy of the LFS objects in the repo. Couldn't that be done in parallel to #846 (closed)?
when you do 'git lfs pull' you only download the LFS objects that your current Git working directory knows about. But you may have just pulled commits that reference other LFS objects.
This is why we are not considering, at this point, doing a mirroring of LFS objects with git.
Thanks @jacobvosmaer-gitlab! You're right that on a bare depot things are totally different because there are no files that the smudge filters can work on... The solution might be to have non-bare repos when there's LFS data, but this opens another can of worms
The way we implemented LFS in GitLab it is strictly possible to extract a list of all LFS objects a project has access to, and then one could try to copy those files one by one to the other GitLab server. But this would not scale well (building that list requires scanning the entire (global!) table of LFS objects) and it would only work if the sending end is a GitLab server. This is because 'give me a list of LFS objects associated with this project' is not part of the LFS protocol. (And having said that I am not sure if it would be wise to have it in GitLab because of the massive table scan involved.)
Just for the record, our GitLab Geo product (still under development) will be able to replicate the global set of LFS objects from one GitLab server to another. But this is a different thing from mirroring a single Git+LFS repository from one GitLab (or GitHub or BitBucket) server to another. I may be wrong but it seems like the design of LFS is making it hard to do this in an efficient and comprehensive manner for a single repository. (Because 'comprehensive' forces you to scan all commits that are mirrored individually.)
How difficult would it be to add a table that stores a list of per-project LFS objects? The initial full-table scan would have to be done once, then operations would be much faster, right? If that makes any difference, we will be Gitswarm EE customers in a few weeks.
In any case, for the few projects that require it, according to the discussion on GH I should be fine with a daemon pushing & pulling, as long as it does it on non-bare repos, right?
#!/bin/shso_repo=$1si_repo=$2reponame=$(basename"$so_repo")git clone --mirror$so_repocd$reponamegit lfs fetch --allgit remote add sink $si_repoecho"# Non-LFS objects to push to $si_repo:"git push --dry-run sink '*:*'echo"# Tags to push to $si_repo:"git push --dry-run sink '*:*'--tagsecho"# Pushing non-LFS objects to $si_repo..."git push sink '*:*'git push sink '*:*'--tagsecho"# LFS objects to push to $si_repo:"git lfs push --dry-run--all sinkecho"# Pushing LFS objects to $si_repo..."git lfs push --all sink
Which works fine in my admittedly limited testing. I see two approaches here:
Per my comment just above, why not store the list of LFS objects that pertain to a project? This way you can avoid the table scan - it only happens once for the initial migration, then new LFS objects can be added to the index at the time of upload, right?
Create a daemon that's connected to Gitlab using per-project webhooks, pulling & pushing whenever there's an update.
The first approach would make it more integrated, but I understand that it might be too complicated.
I don't think implementing the second approach would be complicated, but the problem is that it requires an external daemon and increases storage usage since they are duplicated.