Add logging to understand impact of SetRepositoryCache#fetch

Why

Gitlab::RepositorySetCache#fetch is not atomic (#506430) describes an issue with SetRepositoryCache#fetch

To better understand the impact, we want to add logging to give visibility to the connection between the fetch and any errors that we're currently more or less guessing are related.

If we log the correlation_id on cache hit, that might let us validate that whether it's related to the error in in question seen here: https://log.gprd.gitlab.net/app/r/s/PaEVx

The current theory we want to validate is if that that error is the result of a cache hit to stale cache (the whole caching issue we looked at).

I took one of those errors' correlation_id and searched logs for it here . If we add this logging and only ever see the error happening on cache hits, it would show up here right? I think it gets a closer to knowing this is really the issue. If we see it also happen on cache misses though, then we may be on the wrong track.

If all that is correct, then the result of adding this logging would be that we'd have more confidence that fixing the caching issue would actually resolve the problem.

The code area we're looking at is in RepositorySeCache#fetch here

    def fetch(key, &block)
      full_key = cache_key(key)

      smembers, exists = with do |redis|
        redis.multi do |multi|
          multi.smembers(full_key)
          multi.exists?(full_key) # rubocop:disable CodeReuse/ActiveRecord
        end
      end

      return smembers if exists

      write(key, yield)
    end

What

  • read up on our logging
  • add structured logging for correlation_id on cache miss in RepositorySeCache#fetch
  • put a feature flag around this

Related links

original issue

spike to resolve issue

error hypothesized to be related

correlation_id related to this error

RepositorySetCache

Edited by Hunter Stewart