Purge large files and sensitive data from Git repository
GitLab uses refs/keep-around
to prevent data being garbage collected so that commits that are pushed into a merge request are always accessible. This provides and audit trail, and ensures that discussions on commits are never left dangling. But the down side is that removing large files becomes almost impossible.
Proposal
BFG repo cleaner is the best way to remove large objects from a repository.
After large files have been removed from the repository using commands like bfg -b 10M
, and Project Owners will be able to:
- access an interface to upload a text file containing the objects to be removed
- the text file should be in the format of
<old object id> <new object id>\n
where the old object will be removed (this is the structure ofobject-id-map.old-new.txt
file generated bybfg
) - after the refs have been removed GitLab should:
- run garbage collection to remove the data
- notify the Administrator of success or failure
Example object-id-map.old-new.txt
:
01f544006314b5ffcbac4b48bcda2bdaf42a0e94 4be23effff51e6ddae7f43ad5fd92d9b71227f78
15da2c6688edd43deb406b6d92b22b5972c945f9 a9ad5e9134af6b334832e5c37a30225249a386b5
330dcd697c1eb11f1b19d5f5ac1362a8ed5a04ce 8b19d5be8e5db1d236917b23ad5bb93038c9d009
GitLab will then remove all the refs
associated with the old object ids.
Designs
Project settings → Repository | On file upload |
---|---|
![]() |
![]() |
A notification email is sent to the project owner upon completion of the cleanup process.
Email (on success)
Repository cleanup completed on https://gitlab.com/gitlab-org/gitlab-ce.
Repository size before cleanup: XX
Repository size after cleanup: XX
Email (on failure)
Repository cleanup failed on https://gitlab.com/gitlab-org/gitlab-ce
Repository size before cleanup: XX
Repository size after cleanup: XX
--
// Error log (if any)
Links / references
Original report
Sadly, it happens that people push big files into git repos. Usually i tell them to re-write their git history with tutorials such as http://stevelorek.com/how-to-shrink-a-git-repository.html . Afterwards they force push stuff to our gitlab server and after a while the garbage collection will then eventually reduce the repo size on the server as well.As an admin ouf our group's gitlab server i can ssh into the server, go to /var/opt/gitlab/git-data/repositories/<user>/<repo>
and issue the following commands to speed this reduction up:
$ git reflog expire --expire=now --all
$ git gc --prune=now
$ git gc --aggressive --prune=now
Doing this, i however noticed that this does not reduce the size of a repository on the gitlab server in case the repository had merge-requests on children of the large-file commit.
Let's assume the repository including the huge file looks like this (you can see something similar like this in the above directory with git log --graph --decorate --all
):
* commit Z (HEAD, master)
|
| some change Z
|
* commit M
|\ Merge: L P
| |
| | Merge branch 'Patch-1' into 'master'
| |
| | See merge request !1
| |
| * commit P (refs/merge-requests/1/head) <----- here is the problem
|/
|
| Some patch...
|
* commit L
some changes and a huge file "HF"
Now after re-writing history, and deleting HF
from L
everything on the client's repo seems fine. After force-pushing however the server git repo state will be something like this:
* commit Z' (HEAD, master)
|
| some change Z
|
* commit M'
|\ Merge: L' P'
| |
| | Merge branch 'Patch-1' into 'master'
| |
| | See merge request !1
| |
| * commit P'
|/
| Some patch...
|
* commit L'
some changes, now without "HF"
* commit P (refs/merge-requests/1/head) <----- here is the problem
|
| Some patch...
|
* commit L
some changes and a huge file "HF"
As you can see, the merge-requests' references are still there which hinders git gc
to ever forget L
. Without ssh access to the server the clients can never see these references and even as an admin i'm not sure... Should i remove them or move them to the according rewritten commits? Currently, my guess is that the issue / merge request UI uses them.
There are further references like refs/tmp/<hash>/head
that i left out of the above examples for clarity... same question here: delete / move? How?
Also, is there maybe some less manual approach that the users could trigger themselves?
MRs in progress / finished
-
gitaly-proto!242 (merged) -
gitaly!990 (merged) -
gitlab-org/gitlab-ce!23189 -
gitlab-org/gitlab-ee!8712