Modify error behaviour or metrics for orphaned files when removing
Summary
When Geo::FileRegistryRemovalService
on a Geo secondary is unable to find a (potentially orphaned) object, it emits an error and appears to finish the cleanup task. The error appears to be non-actionable but increases the Sidekiq error ratio and can trigger alerts or pages.
Steps to reproduce
We haven't reproduced this behaviour as of yet but are observing it with a Dedicated customer where there is a large number of these errors occurring - for example, 16503 within a 3-hour window. We have a theory this is down to a pattern where a rapid sequence of 1) creating or copying a project, 2) running a series of tests, 3) quickly deleting them - potentially leading to this large number of non-replicated files, as Geo is able to synchronise the database but not the objects within the time window.
What is the current behaviour?
When Geo::FileRegistryRemovalService
attempts to cleanup a file we see the following log messages & statuses:
-
INFO
: Executing -
INFO
: Lease obtained -
ERROR
: error.message: "Could not build uploader" -
ERROR
: error.message: "Unable to unlink file because file path is unknown. A file may be orphaned." -
INFO
: Removing file registry -
INFO
: File & registry removed
(note: steps 3 & 4 might be same log event but displayed separately in our platform)
Because 3/4 are errors, this looks to cause a burst in Sidekiq errors and results in alerts/pages for a non-actionable event. It appears the task is considered complete as it is not retried again.
Screenshot
What would good look like?
Emit the events as a WARN
; this (presumably) should not impact error ratios and avoid breaching thresholds/triggering alerts for this event.