Monitor and fix unhandled Sidekiq job errors (Ai::RepositoryXray::ScanDependenciesWorker)
Context
In #476177 (closed), we introduced the Ai::Context::Dependencies::ConfigFiles::Base
class where the intention is for each dependency manager config file type to be represented by a child class. Each child class class contains the parsing logic in extract_libs
, which returns a list of libraries and their versions from the file content. It's executed when the config file parser (Ai::Context::Dependencies::ConfigFileParser
) runs .parse!
on each config file object.
The Sidekiq worker Ai::RepositoryXray::ScanDependenciesWorker
runs Ai::RepositoryXray::ScanDependenciesService
, which executes the config file parser.
Problem
The parsing logic in extract_libs
sometimes misses certain edge cases in the file content. When an unexpected data type or value is encountered, it throws an exception that bubbles up as Sidekiq job error. These are unhandled exceptions and should be either fixed or caught and re-raised as a known ParsingError
instead.
Typically the rate of these unhandled errors is quite low compared to the success rate (see Grafana worker detail). So these are considered low priority "bugs", but they should still be addressed for code completeness and to avoid impacting our error budget.
References
-
Kibana failed job logs for
Ai::RepositoryXray::ScanDependenciesWorker
: https://log.gprd.gitlab.net/app/r/s/NbOQ9 - Grafana Worker Detail: https://dashboards.gitlab.net/goto/P1PeuukHg?orgId=1
- Error budget for groupcode creation: https://dashboards.gitlab.net/goto/Xrt3xRZHR?orgId=1
- Many of these errors were fixed during the initial roll out (#483928 (closed)). See related MRs.
Proposal
We should periodically monitor Kibana logs for these failed Sidekiq job errors and fix/handle them as needed. This needs to be a continuous process throughout the lifetime of the X-Ray service because new config file classes may be added over time. New edge cases may also be uncovered as the service becomes used more widely.
UPDATE [2024-02-04]
Per #500575 (comment 2303465333), we have promoted this to an epic (&16680 (closed)) and will create child issues as needed.