Some thoughts about 'stage 2'
(Responding to https://gitlab.com/gitlab-org/git-access-daemon/blob/0ed34d2f29ec55a225bee472d3ac47abe51a67c8/design/README.md)
Once we have availability taken care of we will need to start working on the performance side.
I don't understand how anything written before this point solves the availability problems we have had.
When a Rugged::Repository object is created what happens is that all the refs are loaded in memory
I am not sure if this is true.
irb(main):004:0> start = Time.now; puts Rugged::Repository.new('/path/to/gitlab-org/gitlab-ce.git'); puts Time.now - start
#<Rugged::Repository:0x007fdd4c973ea8>
0.001868102
=> nil
Creating the Rugged::Repository object is not that expensive it seems.
For performance reasons these refs could be packed all in one file or they could be spread through multiple files.
Note to the reader who is not familiar with Git internals: this is a description of how Git stores refs.
A single file is not nicely managed neither by NFS or CephFS and can create locks contention given enough concurrent access.
I am assuming this is about the packed-refs
file. This file is not updated often. Prior to 8.14 we ran git gc
too often in GitLab (once an hour or once every 10 pushes), and each git gc
run will update the packed-refs file. Since 8.14 we only update the packed-refs file once every 200 pushes.
I think the bigger issue is that packed-refs is not authoritative: each entry in packed-refs needs to be checked against the corresponding loose file. If present the loose file takes precedence. So if you have a packed-refs file with 50,000 entries, and you want to know the current state of each of those entries, you need to check for the existence of the 50,000 corresponding loose ref files and read the contents of each present loose ref file.
So, I propose that we use this caching layer to load the refs into a memory hashmap.
Side note: I think libgit2 already does this internally.
Serve the refs list from this cache, preventing calls for 'advertise refs' from git clients to hit the filesystem at all.
I took a brief look at the current output of git upload-pack --stateless-rpc --advertise-refs .
in the gitlab-org/gitlab-ce.git directory on gitlab.com. I saw two things we can work on independently of having some sort of central access point.
- GitLab leaks references in the
refs/tmp
namespace: we can address this by putting adding a timestamp to each tmp ref name, and deleting old tmp references during garbage collection -
GET /info/refs
(AKAgit upload-pack --stateless-rpc --advertise-refs
) returns too many results: entries underrefs/tmp
andrefs/keep-around
do not need to be sent to the client. We could address this by writing our own implementation.
Serve branches, tags and last commits through an HTTP API that can be consumed by the workers.
I am skeptical about using HTTP. It has relatively high overhead.
Keeping Memory Down
This section goes into a lot of detail about implementing LRU. Why not use a standard LRU cache solution such as memcached, or a separate Redis instance configured for LRU?
Overall impression
I am not convinced we need to create a new 'git access daemon'. I think this is a premature design decision. I think it would be more effective and less work to:
- Take more ownership of how we retrieve refs, e.g. filter out application-internal refs when generating
GET /info/refs
responses - Add an optional LRU cache for immutable 'objects' (raw blobs, diffs) we retrieve from Git repositories
- Improve caching within the Rails application: we have already decided to integrate gitlab_git into gitlab-ce/ee. And the Rails application code in GitLab has full access to Redis 2.8+. If we want to do better ref caching then I think all the pieces are there.
The only thing we don't have a good solution for in our current architecture is an LRU cache for large chunks of data. For this we should consider using a standard solution (separate Redis configure for LRU or memcached).