Geo: Optimize replication of project repo keep around refs
Summary
For example, commenting on an issue calls Repository#keep_around which writes a keep around ref if it doesn't exist for the current commit. And we trigger a repo update event on every call of Repository#keep_around, which causes secondaries to replicate the repo.
This is valid behavior, but uses a lot of overhead, so there is an opportunity for performance optimization.
Steps to reproduce
- In the secondary site, tail
geo.log - Comment on an issue
- Observe logs like
{"severity":"INFO","time":"2022-05-19T00:50:32.648Z","correlation_id":null,"pid":50126,"host":"127.0.0.1","class":"Gitlab::Geo::LogCursor::Daemon","message":"Repository update","project_id":6,"source":"repository","resync_repository":true,"resync_wiki":false,"scheduled_at":"2022-05-18T17:50:32.629-07:00","replicable_project":true,"job_id":"d7d2a28e31b6840ff61a69fe","event_id":2,"cursor_delay_s":0.51}
Proposal
At the very least, instead of enqueuing Geo::CreateRepositoryUpdatedEventWorker for every call of keep_around, we should enqueue it only if keep_around generated a write_ref call.
E.g. if you comment 5 times in a row, and the project repo hasn't been updated during that time, then Geo::CreateRepositoryUpdatedEventWorker should be enqueued once instead of 5 times.
Edited by Michael Kozono