Store a list of all git references for a repository in the database, instead of redis

Summary

Per discussion in https://gitlab.com/gitlab-org/gitlab-ce/issues/65323#note_211518183 and https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32412

Currently, we store a cache of all branch names and tag names for a repository in Redis. I think we could instead introduce a repository_refs table to store this information instead.

Improvements

This is a large and frequently-accessed piece of data - the contents of the Redis cache key for gitlab-ce, for instance, comes to 100KiB.

Redis is our most expensive data store. We should only put things in it if it's the best place for them.

We're currently storing branch names and tag names as an activesupport-serialized array. There is no way to apply sorting, pagination, or filtering to the data in this form, which we need to do for, at least, the branches API. Redis is not the best choice of data store for values that require these kinds of operations.

Checking for existence of a branch requires us to download the entire list of names at the moment. https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32412 attempts to solve that by converting the data to a Redis set, which solves this particular thing, but might be overly complicated.

I'd suggest a table like:

create_table :repository_refs do |t|
  t.boolean :exists # sometimes we need to delete a ref in the repo but keep it in the db
  t.smallint :ref_type # allow us to filter tags vs branches easily
  t.text :ref
  t.binary :sha # we may not need this, but on the other hand, it might be good track it upfront

  # no timestamps
end

We can refresh the database contents with the full list of refs proactively, on push. We can modify all existing code that queries refs to only ever look them up from the database.

The merge_requests.source_branch, merge_requests.target_branch and merge_trains.target_branch columns can be made foreign keys on this table.

I thought we could modify protected_branches and protected_tags to do so as well, but we'd still need to support wildcards somehow. We could introduce an M:N relationship for that if we felt it was a really good idea.

Risks

Cache invalidation 🤷 . The change needs to be zero-downtime-friendly, which - it turns out - is quite difficult.

This would be a large and frequently-updated table. Perhaps it's just not a good idea to do it in postgres? Maybe we need to shard it, or do other clever things to prevent it from getting too big? We have millions of projects,. and each project can have thousands or millions of refs, so it could easily become the largest single table we have.

Involved components

  • app/finders/branches_finder.rb
  • app/models/repository.rb
  • lib/gitlab/git/repository.rb
  • lib/gitlab/repository_cache.rb
  • lib/gitlab/repository_cache_adapter.rb

Any thoughts @kerrizor @DouweM @toon @stanhu ?

Assignee Loading
Time tracking Loading