Skip to content

Changes from `sum` to `rate` for sidekiq exceptions

John Skarbek requested to merge jts/tweak-sidekiq-alert into master
  • Currently we are charting and alerting on the total sum of exceptions which never reset until a process dies
    • From this we can gather that the RepositoryUpdateMirrorWorker and ProjectSeviceWorker never die unless we upgrade GitLab
    • Other controllers frequently die for X reason which is why their metrics appear to never climb above a certain threshold
    • The above is not useful as we'll always throw errors so alerting on total count doesn't prove anything as the time in which a problem occurred might have been long ago
  • This changes this alert to a rate as the sum is pretty much non actionable
  • Our baseline is seemingly below 0.2 for the last 1 day with spikes up to 1
  • This alert may need tweaking in the future, but at least this provides us with a more actionable alert
  • This change MUST also be reflected in the graph as well
    • This will be done manually
  • Closes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5203

Merge request reports