Alert User when Online Garbage Collection is not Working Properly (#12534) · Epics · GitLab.org

Alert User when Online Garbage Collection is not Working Properly

## Context The online garbage collector uses agents running asynchronously from the main goroutine to perform garbage collection. Currently, these agents log failures, but the parent registry process will never fail on purpose due to failures related to online garbage collection. On `GitLab.com`, we have monitoring and alerts set up so that the registry team can take action if there are long-standing or critical issues with online garbage collection. ## Problem For self-managed installs, we may not be able to rely on admin's self-direction on setting up monitoring and configuring alerts. This could lead to a large backlog of queued items ready to be reviewed, or incorrect configuration being ignored from the start. For large deployments with proper alerts and many registry instances, we may not wish to tie the health of the overall process to the health of online garbage collection, serving API requests takes precedence. While online garbage collection is a crucial operation of the database enabled registry, it is not strictly time critical — a reasonable backlog of reviews can be worked though without damage to the integrity of the process. Therefore, we should not compromise the availability and reliability of the API over online GC. ## Solutions ### Direct Admins to Monitoring Documentation in Migration Instructions Monitoring and alerts are useful in general, not only for the specific problems we seek to address in this issue. We may not be able to count on 100% compliance, so while this is the approach that has served us well for `GitLab.com`, we must also consider approaches that do not require user compliance. Additionally, we should be sure to document that there is an expected delay between the completion of a database import, and online GC starting. In particular, we should mention that for multistep imports, step three will import blobs at a faster rate than online GC will (by default) remove them and an upward tend in reviews is normal. ### Investigate how Sidekiq Job failures are Reported to Self-Managed Users Conceptually, Sidekiq jobs and the online GC workers are similar, even though the implementations are quite different. It's likely, that the problem of how to report failures of failing background jobs has been solved elsewhere within the GitLab application. Ideally, we can implement a similar version of this issue. This would take advantage of an additional familiar alert for self-managed users, rather than a new alert delivered via a new method. ### Create a New Registry CLI Command to Query Online GC Stats This solution would combine some of the health check queries that we run against the database to determine the status of the online GC such as: ```sql select count(*) from gc_blob_review_queue where review_after < NOW(); select count(*) from gc_manifest_review_queue where review_after < NOW(); ``` This would be a relatively light lift, and since the user is already interacting with the CLI to perform the migration (at least on omnibus). We should document the new command as a simple post-migration health check that can be performed without knowledge of the implementation details of the registry database. By default, we should explain what each number represents (and if it indicates a problem), as well as a call to action to set up proper alerts for production environments. As a bonus, this is a useful quality of life improvement for developers as well.

epic