Set up alerts when Repository X-Ray worker errors exceed a certain threshold
Context
In #476177 (closed), we introduced the Ai::Context::Dependencies::ConfigFiles::Base
class where the intention is for each dependency manager config file type to be represented by a child class. Each child class class contains the parsing logic in extract_libs
, which returns a list of libraries and their versions from the file content. It's executed when the config file parser (Ai::Context::Dependencies::ConfigFileParser
) runs .parse!
on each config file object.
The Sidekiq worker Ai::RepositoryXray::ScanDependenciesWorker
runs Ai::RepositoryXray::ScanDependenciesService
, which executes the config file parser.
Problem
The parsing logic in extract_libs
sometimes misses certain edge cases in the file content. When an unexpected data type or value is encountered, it throws an exception that bubbles up as Sidekiq job error. These are unhandled exceptions and should be either fixed or caught and re-raised as a known ParsingError
instead.
Typically the rate of these unhandled errors is quite low compared to the success rate (see Grafana worker detail). So these are considered low priority "bugs", but they should still be addressed for code completeness and to avoid impacting our error budget.
References
-
Kibana failed job logs for
Ai::RepositoryXray::ScanDependenciesWorker
: https://log.gprd.gitlab.net/app/r/s/NbOQ9 - Grafana Worker Detail: https://dashboards.gitlab.net/goto/P1PeuukHg?orgId=1
- Error budget for groupcode creation: https://dashboards.gitlab.net/goto/Xrt3xRZHR?orgId=1
Proposal
First investigate if there's a way to set up Grafana alerts and/or Slack notifications when the error rate in the worker graph exceeds a certain threshold (0.1%
or lower recommended).
When these notifications are triggered, consider automatically opening an actionable issue similar to #517173 (closed) (with priority labels) so that our team can take action on it as soon as possible.
If notifications cannot be implemented with available tools, then we could consider setting up a recurring team task to periodically monitor Kibana logs. However, this should be a last measure.
Further details / references
On possible Grafana implementation (ref: #500575 (comment 2303465333)):
Alerts would be a good idea though. It looks like Grafana has a feature for this purpose (screenshot below), but I haven't tried it before. There could also be a way to send alerts to our group Slack channel (or to a new channel named
#g_create_code-creation-alerts
). So we could open up an issue to investigate setting up these notifications. We might even be able to have it automatically open a new issue for us at a certain threshold.
If a recurring task is necessary (ref: #500575 (comment 2302789226)):
I think we should set up a recurring task (perhaps quarterly) to review the logs and address any new unhandled exceptions that have appeared.