Job artifacts registry rows are not being removed when artifacts are migrated to object storage

Summary

On GitLab.com, we currently have Geo, object storage, and "Background upload of artifacts to object storage" enabled all at once. This causes us to receive a steady stream of new artifacts to local storage on the primary. These are replicated to the secondary via Geo, which makes a copy on disk and adds a Geo::JobArtifactRegistry entry tracking it.

On the primary, the background uploader puts the artifact into object storage, then removes it from disk.

On the secondary, there should be a cronjob running that notices the file has been moved to object storage, and removes both the copy on disk and the tracking entry.

What is the current bug behavior?

As noticed in gitlab-com/migration#313 (comment 73069471) , we saw half a million rows in the tracking table following a few weeks of operation. This seems much higher than I'd expect if the background worker was operating correctly.

What is the expected correct behavior?

Number of tracking entries should track the current number of local artifacts.

Possible fixes

Two possibilities. Either the cronjob isn't working at all, or the cronjob is working, but is running too infrequently.

There were two possible approaches to this problem - an asynchronous background job on the secondary, or a Geo event log entry emitted on the primary at migration time (following the model of hashed storage migrations). The event log option can be to be much more responsive than the async cronjob option, so might be worth reconsidering if it's the latter possibility.

Or we could just increase the frequency at which the cronjob runs.

/cc @toon @ash.mckenzie

Edited May 16, 2018 by Nick Thomas
Assignee Loading
Time tracking Loading