Skip to content

High rate of config.lock file errors on Geo testbed

I believe we added a new Sidekiq worker RepositoryRemoveRemoteWorker. It looks like on the Geo testbed, we seem to have a high number of stale config.lock files:

https://sentry.gitlap.com/gitlab/geo1/issues/118898/

image

First of all, why do we have this many removals on an instance that is basically idle? I think it's because we call Repository#fetch_remote without a remote in https://gitlab.com/gitlab-org/gitlab-ee/blob/1efb8287d29b08086fe2719c6ef5b9b2e30dba8a/ee/app/services/geo/base_sync_service.rb#L85, which then calls this https://gitlab.com/gitlab-org/gitlab-ee/blob/1efb8287d29b08086fe2719c6ef5b9b2e30dba8a/app/models/repository.rb#L1000.

It seems to me for Geo, we shouldn't ever have to add/remove remotes as there should be a fixed remote for the primary and secondary.

I consider this high priority to fix because I can foresee lots of Geo replication failing as a result.

If we do get this error, I think it would be nice if we log these messages somewhere where it's easy to clean up stale files.

/cc: @tiagonbotelho, @DouweM

Edited by Stan Hu