Skip to content

Retry sending Usage Ping in case of network errors

Alishan Ladhani requested to merge 216155-retry-sending-the-usage-ping into master

What does this MR do?

In #216155 (closed), it was noted that Usage Ping can be made more reliable by retrying in case of a network error.

There are two types errors that we expect:

  • The Version app (version.gitlab.com) is unavailable
  • The usage ping request returns an unsuccessful status code

GitlabUsagePingWorker will retry 3 times over ~24 hours when it encounters an error.

Approaches considered

Approach Retry mechanism Considerations
One worker - compute and send data Sidekiq Computing usage data is somewhat expensive, since it generates hundreds (or even thousands) of DB queries. Network errors are assumed to be relatively infrequent, so this should not be a problem.
One worker - compute and send data Cron Currently, the worker is scheduled to run once a week, on a random day/hour/minute. This ensures that requests to the Version app are distributed evenly and the load is predictable. Scheduling the worker to run more than once a week could make the load less evenly distributed, and more difficult to predict.

For example, if the worker is scheduled to run once a day, the schedule of the worker (daily) is no longer tied to the schedule of usage ping (weekly). We could keep timestamps in Redis to indicate when we last computed/sent data, and check those to ensure a weekly cadence. But we lose control over which day of the week usage pings are sent. For example, if people are more likely to set up self-managed instances on Monday, we will see a continuously increasing number of usage ping requests every Monday.

The consideration for a single worker retrying via Sidekiq also applies here.
Two workers, one computes data, other sends data Sidekiq In this scenario, GitlabUsagePingWorker computes data, and schedules GitLabUsagePingRequestWorker to send data. Each worker can have its own retry policy. The two workers would be quite coupled.
Two workers, one computes data, other sends data Cron Similar to one worker retrying via cron.

We need to ensure an even distribution of requests to the Version app. A possible approach could be scheduling GitLabUsagePingRequestWorker every hour at a random minute.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by Alishan Ladhani

Merge request reports