2022-09-19 Service Ping failed due to single non-hardened metric
In an attempt to generate the service ping report for the week of 2022-09-19, the team encountered an issue with a single non-hardened metric having performance issues and blocking the report from being generated.
@mikolaj_wawrzyniak investigated and had this to report:
Service Ping payload launched yesterday failed to send payload. Following output was returned in console:
[ gprd ] production> [start_at = Time.now.to_i, resp = GitlabServicePingWorker.n
ew.perform('triggered_from_cron'=>false), end_at = Time.now.to_i]
Sending event c2cd070259234ae9be49f736c802ae70 to Sentry
Sending event f64f2f938df14b8abcfebced0fbdefa7 to Sentry
Sending event 949398a2a6b143399266346b0ae9e165 to Sentry
Sending event 10b1d6aa2e3e47ccad02c1898ed61e73 to Sentry
Sending event 2d5e5d48e1d242e6b8ef4279e66c3a0f to Sentry
Sending event e25e202426134b59aea00134699b03d1 to Sentry
Sending event 5be379109a46449da5fc8fe9b274f4d1 to Sentry
Traceback (most recent call last):
7: from (irb):1
6: from (irb):2:in `rescue in irb_binding'
5: from app/workers/gitlab_service_ping_worker.rb:31:in `perform'
4: from lib/gitlab/exclusive_lease_helpers.rb:38:in `in_lock'
3: from app/workers/gitlab_service_ping_worker.rb:35:in `block in perform'
2: from app/services/service_ping/submit_service.rb:24:in `execute'
1: from app/services/service_ping/submit_service.rb:73:in `submit_usage_data_payload'
ServicePing::SubmitService::SubmissionError (Usage data payload is blank)
Upon inspecting last event send to sentry I got https://sentry.gitlab.net/gitlab/gitlabcom/issues/3453558/events/112934744/?environment=gprd which pointed to https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/utils/usage_data.rb#L297 From that method it can be noticed that there is no db layer error handling, which is present for every other database metric. So in situation that this metric reaches timeout it can fail whole Service Ping report.
Steps to remediate
- Hot patch code to skip this metric and retry payload -> short term, one time only solution
- Move broken metric to instrumentation class to provide error handling -> stable approach #374601 (closed)