Geo::EventWorker fails on delete event if the object doesn't exist
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Overview
Geo requires the ListBucket permission in order to do a HEAD request, to check if a particular file exists in an S3 bucket. Geo does this check in order to:
- Determine if deletion will work as expected
- Determine if it can skip downloading file on first-time set up
- Determine if it is able to checksum the file
Problem
Without ListBucket permission, if the file does exist, then AWS will respond 200. When the file doesn't exist, AWS will respond 403 instead of the expected 404.
This causes the check code to raise an error. This is too extreme of a behavior in our usages of such HEAD requests.
Example error, during file deletion job
When setting up Geo Replication at a new secondary, from an existing primary site with non-trivial data/history, the Geo::EventWorker worker can experience a lot of failures when it receives a "delete" operation for an existing object (like an MR diff or a Job Artifact) when they've expired at the primary, but the underlying object hasn't replicated to object storage yet.
This comes from the attempt to do a HeadObject in ee/app/services/geo/file_registry_removal_service.rb:155:in object_file'`, where if the object doesn't exist it gets a 404 (or, in AWS and lacking ListObject ListBucket, a surprising 403).
The error is:
Expected(200) <=> Actual(403 Forbidden)
excon.error.response
:body => ""
:cookies => [
]
:headers => {
"Content-Type" => "application/xml"
"Date" => "Thu, 11 Jan 2024 21:59:05 GMT"
"Server" => "AmazonS3"
"x-amz-id-2" => "<REDACTED>"
"x-amz-request-id" => "<REDACTED<"
}
:host => "<BUCKET_ENDPOINT>"
:local_address => "SOEMTHING"
:local_port => 47480
:method => "HEAD"
:omit_default_port => false
:path => "/merge_request_diffs/mr-XXXXXX/diff-YYYYYY"
:port => 443
:query => nil
:reason_phrase => "Forbidden"
:remote_ip => "SOMETHING"
:scheme => "https"
:status => 403
:status_line => "HTTP/1.1 403 Forbidden\r\n"
AFAICT this is the fog gem, using excon and expecting a 200 and is not getting it.
It would be nice if the delete service could either check first and skip the head if the object doesn't exist, or detect and swallow this particular problem if it does occur, because it causes some alarm when first seeing it, and if monitoring and alerting is configured there can be a large % of Sidekiq jobs failing during backfill leading to either a need for alert silences (risky), or alert fatigue.
Observed in practice in Dedicated; see internal issue
Proposal
Swallow the error. Treat the file as "does not exist". But track and log the error.
Before:
- In a secondary site: Skip backfill sync when data already exists: Currently, the sync fails, which, after correcting the permissions, ends up retrying later and overwriting the file.
- In a secondary site: Replicate a blob deletion: Currently, the EventWorker job errors and retries 3 times, and then the deletion is dropped. The registry row eventually gets deleted and orphans the file.
- In a secondary site: Avoid attempting to checksum something that doesn't exist: Currently, you get a different checksumming error
- In a primary site: Geo API retrieve endpoint: Currently, the primary site would 500, I think
After:
- In a secondary site: Skip backfill sync when data already exists: Sync probably overwrites the existing file.
- In a secondary site: Replicate a blob deletion: The file gets orphaned on the secondary site.
- In a secondary site: Avoid attempting to checksum something that doesn't exist: Checksumming error
File is not checksummable. - In a primary site: Geo API retrieve endpoint: The secondary site marks the sync failed due to missing on primary, and retry later.
Summary:
Overall it seems same or better behavior. 2. is the biggest problem (both before and after). But either way, we'd have to redesign deletion somewhat to avoid orphaning the file.
For example, in the case of the delete job error:
The code in question is trying to delete the file, so if it already doesn't exist that's success not failure and it shouldn't result in a failed Sidekiq job with all the associated potential impact on metrics/alerting.
Implementation Guide
- Document this AWS S3 404/403 problem in https://docs.gitlab.com/administration/geo/replication/troubleshooting/synchronization_verification/.
- Track in Sentry, log (
track_exceptiondoes both at once), and swallowExcon::Error::Forbiddenerror in:-
BlobReplicatorStrategy#resource_exists?and returnfalse -
FileRegistryRemovalService#object_fileand returnnil
-
- In the log extra data, include a link to the troubleshooting doc section.