Lacking appropriate logging information for ProjectUpdateRepositoryStorageWorker
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem to solve
Lack of visibility into the on-going operation of ProjectUpdateRepositoryStorageWorker
Target audience
- Sidney, Systems Administrator, https://design.gitlab.com/research/personas#persona-sidney
Further details
When migrating storage data from one storage location to another, there does not exist any logging nor metric data related to the actual job having been queued, pulled, actively being worked on. The best visibility one can gather, is ssh'ing into the storage server and hunting for processes related to gitaly that have spun up related to the move. This prevents us from being able to see progress for a move, being able to pluck specific time stamps from when moves have started and ended, and prevents us from knowing how busy this worker is. We also lack error details. While doing this work for GitLab.com, we've encountered numerous failures that leave repos in a few different states when all is said and done. But there are no useful errors, either in logs or in Sentry. Useful being highlighted here as the only error I get, spawns from gitaly with a gRPC timeout.
Proposal
- When someone kicks off a storage migration, let's log it.
- When sidekiq pulls the job to perform the storage migration, let's log it
- When we hit a failure of any sort, let's log it, specifically:
- did we fail to move the storage for X reason?
- did the storage migration even start successfully?
- why did the repo move, but not the wiki?
- When the migration completes, let's log it
For all of the above, we should have metrics for prometheus for appropriate monitoring and alerting.
What does success look like, and how can we measure that?
Success will be visible. Right now this controller leaves us flying blind, or rather, it's horribly difficult to determine the status of what this controller is doing.