Optimize when X-Ray dependency scanning jobs run to minimize redundancy
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Context
Currently, the Ai::RepositoryXray::ScanDependenciesWorker is triggered whenever there is a new commit pushed to the default branch of a Duo-enabled project. The Ai::RepositoryXray::ScanDependenciesService runs Ai::Context::Dependencies::ConfigFileParser, which pulls the entire worktree from the project default branch and finds/parses the supported dependency config files.
This process happens every time the Ai::RepositoryXray::ScanDependenciesWorker is run, which could be up to a few thousand times a day on very active projects. Currently, this job frequency is easily handled by the Sidekiq service (moreover, only one scan job per project can run at a time), so this is a relatively "cheap" trade off for a simpler first-iteration implementation.
However, this means that we are needlessly re-running a full scan numerous times even when the dependency config files don't change. (See job frequency per project #483928 (comment 2159959995).)
Possible approaches
1. Only process dependency config files when they change
- Only run an initial full scan once. A "full scan" is what we're currently doing for every job: it involves pulling the entire worktree and finding/parsing all config files.
- To support this, we would need a way to tell if the project has had a full scan before.
- When a new commit is pushed to the default branch, only search for config files in the commit's modified file paths.
- In other words, only run the parser when the config files actually change. This would significantly reduce redundant processing.
- This process would also be a lot quicker since there are normally only a small fraction of files are modified in a commit.
- Note, however, that this technically just reduces the amount of processing we do; the job frequency would remain the same.
Potential complications
- Currently, only one X-Ray scan job can run at a time per project. This workflow does not adapt easily to the above approach because it would miss changes that are committed while another scan job is running. So we need to consider how we can limit the number of simultaneous jobs while ensuring that no relevant commits are missed.
2. Only run the scan at a set time interval
- We could keep the "full scan" process and run the X-Ray job as a cron (every 10 minutes to an hour).
- The key here is that we expect customers would be okay with a delay before code generation contexts are updated.
Potential complications
- There may be challenges with scheduling a job for every Duo-enabled project at the same cadence.