High error rate from 'SetGroupSecretPushProtectionWorker'
Background:
The recently introduced SetGroupSecretPushProtectionWorker produced a high error rate with these logs.
Findings:
-
There are 23 worker failures between the 20th and 25th of December.
-
The error pattern resembles the sidekiq retries exponential backoff:
- 20th - 16 errors.
- 21st - 3 errors.
- 22nd - 2 errors.
- 23rd - 1 error.
- 24th - 1 error.
-
All the failing workers share the same two major parameters -
group_idandcurrent_user_id.- From the JSON document found here:
"args": [ "12568027", #group_id "[FILTERED]", "12742232", #current_user_id "[FILTERED]" ] -
The failing workers' logs share the same PG timeout and a duration of ~15 seconds:
PG::QueryCanceled: ERROR: canceling statement due to statement timeout -
The failing PG query is
SELECT "projects"."id" FROM "projects" WHERE "projects"."namespace_id" IN (SELECT "namespaces"."id" FROM "namespaces" WHERE "namespaces"."type" = $1 AND (traversal_ids @> ($2))) ORDER BY "projects"."id" ASC LIMIT $3This query correlates to
projects_scopeinSetGroupSecretPushProtectionService, and from the call stack looks like it was called fromeach_bachinSetSecretPushProtectionBaseService. -
When running a similar query (without
LIMIT) on a production replica using the samegroup_id, the execution time is relatively short - ~35ms.