Set up alerts when Repository X-Ray worker errors exceed a certain threshold

Context

In #476177 (closed), we introduced the Ai::Context::Dependencies::ConfigFiles::Base class where the intention is for each dependency manager config file type to be represented by a child class. Each child class class contains the parsing logic in extract_libs, which returns a list of libraries and their versions from the file content. It's executed when the config file parser (Ai::Context::Dependencies::ConfigFileParser) runs .parse! on each config file object.

The Sidekiq worker Ai::RepositoryXray::ScanDependenciesWorker runs Ai::RepositoryXray::ScanDependenciesService, which executes the config file parser.

Problem

The parsing logic in extract_libs sometimes misses certain edge cases in the file content. When an unexpected data type or value is encountered, it throws an exception that bubbles up as Sidekiq job error. These are unhandled exceptions and should be either fixed or caught and re-raised as a known ParsingError instead.

Typically the rate of these unhandled errors is quite low compared to the success rate (see Grafana worker detail). So these are considered low priority "bugs", but they should still be addressed for code completeness and to avoid impacting our error budget.

References

Kibana failed job logs for Ai::RepositoryXray::ScanDependenciesWorker: https://log.gprd.gitlab.net/app/r/s/NbOQ9
Grafana Worker Detail: https://dashboards.gitlab.net/goto/P1PeuukHg?orgId=1
Error budget for groupcode creation: https://dashboards.gitlab.net/goto/Xrt3xRZHR?orgId=1

Proposal

First investigate if there's a way to set up Grafana alerts and/or Slack notifications when the error rate in the worker graph exceeds a certain threshold (0.1% or lower recommended).

When these notifications are triggered, consider automatically opening an actionable issue similar to #517173 (closed) (with priority labels) so that our team can take action on it as soon as possible.

If notifications cannot be implemented with available tools, then we could consider setting up a recurring team task to periodically monitor Kibana logs. However, this should be a last measure.

Further details / references

On possible Grafana implementation (ref: #500575 (comment 2303465333)):

Alerts would be a good idea though. It looks like Grafana has a feature for this purpose (screenshot below), but I haven't tried it before. There could also be a way to send alerts to our group Slack channel (or to a new channel named #g_create_code-creation-alerts). So we could open up an issue to investigate setting up these notifications. We might even be able to have it automatically open a new issue for us at a certain threshold.

If a recurring task is necessary (ref: #500575 (comment 2302789226)):

I think we should set up a recurring task (perhaps quarterly) to review the logs and address any new unhandled exceptions that have appeared.

Edited Feb 04, 2025 by Leaminn Ma