In #474306 (comment 2025085630), we decided to migrate the Repository X-Ray functionality into the GitLab Rails monolith. This will allow us to run the service outside of the CI pipeline.
Proposal
Reproduce the functionality in Repository X-Ray as a new service inside the GitLab monolith.
MR Implementation
Description
MR/Link
Create the base config file parser classes that can easily support future languages
Hey @emeraldjayde! So there's a rough MR implementation table in the description right now. I figure I'll start with creating a base service, then we can collaborate on migrating all the languages over.
Hi @derekferguson Here is the issue for migrating X-Ray into the monolith. Could you help connect us to groupsecret detection so they could share their experience?
I'm also wondering if we should connect with groupgitaly. We will have to utilize Gitaly to scan the repository files. I wonder if the extra load on Gitaly would be a concern.
I can definitely connect you to groupsecret detection, however, I was looking over what they did and it is implemented as a pre-receive hook in Gitaly to scan commits for possible secrets. I'm not completely sure that their experience would translate to what we are trying to accomplish, since that is a part of Gitaly that we wouldn't need to touch. Also, the pre-receive hooks are a well-established way of doing things, so it would have been fairly straightforward to add one. That being said, I'm happy to make the introductions if you'd like to ask them about it. groupsource code is also very familiar with adding pre-receive hooks. With us trying to scan the repo on a regular basis, rather than triggering some functionality on push, I think that groupgitaly would be the best place to start.
@mjwood@jcaigitlab FYI, it would be great to get your opinion here. You can see our current goals in this issue, but I'm happy to jump on a call to provide more context and explain the situation, if it isn't clear. The basic gist is that we need to find a way to scan a repository and gather dependency information without doing it in a CI job.
@derekferguson Thank you for that context . Yes, in that case, I agree that the work from groupsecret detection probably wouldn't be directly applicable to what we're doing here.
With us trying to scan the repo on a regular basis, rather than triggering some functionality on push
To confirm, is the plan to regularly schedule the scan (e.g. once daily or hourly)? I was originally thinking to trigger it as a Sidekiq job on a new commit push to default branch, similar to what we did in !159818 (closed) but without the CI pipeline integration. So it would also have a self-limiting behaviour in that if another scan job is already running, then it defers the job such that only one scan execution happens per project at any given time.
The basic gist is that we need to find a way to scan a repository and gather dependency information without doing it in a CI job.
Yes, and further to this, my initial idea for the technical approach is the following:
Proposal
The new service in the monolith will utilize Gitaly to obtain the list of file paths in a given repository.
With this list, we would then do glob pattern matching in Ruby to find a dependency lock file within the first 3 directories (root + 2 nested directories deep).
FYI: @derekferguson 2 nested directories deep was decided based on groupcomposition analysis's dependency scanning doc which indicates they only search up to 2 levels deep. This was likely decided due to performance reasons/limitations and also the fact that most projects don't have the lock file any deeper than that. So I think we should follow suit, wdyt?
Upon fetching the dependency lock file blob with Gitaly, we would then parse its contents and extract a list dependency names.
Gitaly questions
Based on the above proposal, my questions for groupgitaly are:
Are there any concerns with fetching the entire list of file paths from a repository? We appear to already do this with the CI rules:exists feature where the list is fetched with project.repository.ls_files(sha), which ultimately calls:
So I'm assuming ListFilesRequest is relatively fast (or fast enough) even on large repositories, but please correct me if I'm wrong.
Is there a Gitaly method like ListFilesRequest that allows us to specify the maximum depth? We likely won't need to pull paths deeper than 2 nested directories.
As mentioned earlier, I'd like to follow an approach similar to CI rules:exists where we utilize File.fnmatch? to find the file. So the pattern matching would be done within Ruby; however, is there perhaps already a Gitaly file search method that would be more efficient to use?
@lma-git - I'm going to refer to the Gitaly experts here, as I'm unclear the performance implications of this proposal.
@justintobler@knayakgl@pks-gitlab - I hate to ping you directly, but when it comes to performance implications, I'd rather ensure the correct set of eyes sees this.
Are there any concerns with fetching the entire list of file paths from a repository?
@lma-git@mjwood I don't think there should be too much of a performance concern with using the ListFiles Gitaly RPC. This RPC uses git-ls-tree(1) and is configured to recursively walk the provided tree listing all contents. This scales with the number of files in the repository, but not the repository history itself.
Is there a Gitaly method like ListFilesRequest that allows us to specify the maximum depth? We likely won't need to pull paths deeper than 2 nested directories.
The git-ls-tree(1) command accepts an optional set of path arguments which are really a set of glob patterns to filter the results. If the list of blob patterns is known before hand, instead of matching in Rails, the ListFiles RPC could be extended to accept an optional list of patterns. This can be leveraged to limit the depth of results in general as well. I can raise an issue in Gitaly if this is something we want to add to the RPC.
So the pattern matching would be done within Ruby; however, is there perhaps already a Gitaly file search method that would be more efficient to use?
As previously mentioned, I think git-ls-tree(1) has what we need, but we would just need to extend the RPC to support it.
Thank you for that valuable information @justintobler!
I don't think there should be too much of a performance concern with using the ListFiles Gitaly RPC. This RPC uses git-ls-tree(1) and is configured to recursively walk the provided tree listing all contents.
Great! That's good to know. I figured it has okay performance even with a repository as large as GitLab since we use exists quite a bit in our own pipeline config.
This can be leveraged to limit the depth of results in general as well. I can raise an issue in Gitaly if this is something we want to add to the RPC.
I think we'll go forward with doing the filtering in Ruby for now. If the performance becomes a problem later, then that might be when we should test if git-ls-tree is more performant, at which time we'll ping groupgitaly.
The git-ls-tree(1) command accepts an optional set of path arguments which are really a set of glob patterns to filter the results. If the list of blob patterns is known before hand, instead of matching in Rails, the ListFiles RPC could be extended to accept an optional list of patterns. This can be leveraged to limit the depth of results in general as well. I can raise an issue in Gitaly if this is something we want to add to the RPC.
@justintobler I don't think patterns would be quite fitting in this context, mostly because git-ls-tree(1) would still have to recursively walk all trees. So while it would reduce the number of entries we return to the client, it doesn't help to avoid the underlying performance penalty in Git itself.
If we want to limit the tree depth I think we have no other choice but to upstream a new option into Git that limits the traversal depth.
Other than that I agree with what you are saying, thanks!
Leaminn Machanged title from Move Repository X-Ray functionality into GitLab Rails monolith to Reproduce Repository X-Ray functionality in GitLab Rails monolith
changed title from Move Repository X-Ray functionality into GitLab Rails monolith to Reproduce Repository X-Ray functionality in GitLab Rails monolith
@lma-git I was taking a look at X-Ray, and unless I'm missing something, it doesn't look like we sanitize the versions. For example, for a PHP Composer lock file, the dependency can look something like this
"symfony/console": "^5.4.11 || ^6.0.11 || ^7",
But here, it looks like we just create the Dependency struct directly with version: "^5.4.11 || ^6.0.11 || ^7". Is that the expected behaviour?
@emeraldjayde I think we should keep in mind that Repository X-Ray was originally built with a focus on velocity. It's likely that the current logic is missing optimizations or parsing considerations for the sake of delivery speed.
With this migration to the monolith, we can be more careful with how we parse the files. I've been using Repository X-Ray as a reference for the logic but not as an SSoT. For example, I've noticed that it doesn't fully parse out go.mod files that have multiple single-line or sectioned requirements; it only extracts the first block. I've modified the logic in !162898 (merged) so that it parses the entirety of the file.
So my advice is to use your best judgment for how we should parse the files. We don't need to replicate the X-Ray logic exactly. We can improve on it.
In this case, I'd recommend sanitizing the version so that it only has the semantic version as a value (e.g. 5.4.11). I'd also look into whether the composer file has a list of actual versions installed. ^5.4.11 || ^6.0.11 || ^7 looks like a requirement to me, not the actual version in use. Edit: Actually, we might keep it this way for now per discussion in #476177 (comment 2073897081).
@lma-git I’m not very familiar with Python or Java. When I was looking into dependency files for these languages, I noticed that many random repositories I checked back then didn’t have "lock" files. As a result, I was parsing the dependency files. I think we can extend support for both types of files. Parsing the lock file if it exists, or otherwise, parsing the dependency file.
Hmm, the issue is that according to the poetry documentation, it's best practice to commit the lock file:
You should commit the poetry.lock file to your project repo so that all people working on the project are locked to the same versions of dependencies (more below).
Similar to Gemfile.lock, it's expected to be in the repository. So if we decide to extend support for both types of files, we'd also have to take into account repos that only have a Gemfile and no Gemfile.lock.
So every lock file type would need to have parsing logic for both the lock file and non-lock file. I'm not sure if we should go down this route.
@lma-git What if we just went with the lock files for the first iteration, and then we can add a new issue to go back and revisit this and add support for the non-lock scenario in the future? That way we cover the best practice use case, but we can also come back and handle the edge cases later.
@lma-git Is it be fair to say that if we don't parse the non-lock file, we would be reducing functionality from the current X-Ray implementation? As in, would fewer projects would have their dependencies added than when it was running as a CI job?
If so, then I'd like to know what the additional effort would be to make sure that we aren't reducing coverage from the previous version, even if that means parsing the dependency file if a lock file doesn't exist. Given the currently low usage of X-Ray, it might not be a problem to change the functionality, but I'd still like to know what the effort is and if there are any risks to going down this path.
If not, then I agree with @mnohr, we can just get the lock files and cover best practices in this iteration, but open a follow-up issue to look at the effort for adding the other use cases.
we would be reducing functionality from the current X-Ray implementation? As in, would fewer projects would have their dependencies added than when it was running as a CI job?
Correct.
If so, then I'd like to know what the additional effort would be to make sure that we aren't reducing coverage from the previous version, even if that means parsing the dependency file if a lock file doesn't exist.
Fortunately, since we're still early in building out this feature, it's not too bad to pivot. I'll have to adjust the parser logic to prioritize lock files if available, and also refactor the existing code a bit.
The main concern we have on parsing non-lock files is the question of best practice and consistency. Normally, lock files specify the exact semantic version of the dependency installed. We then list the dependency version along with its name inside the code generation prompt like so: my_dependency (1.2.0). The premise is that the LLM should make code recommendations based on the specific version of the dependency.
However, non-lock files usually specify a range of versions/requirements, e.g:
In Gemfile: gem 'doorkeeper-openid_connect', '~> 1.8', '>= 1.8.7'
In pyproject.toml: langchain-anthropic = "^0.1.15"
In PHP composer: "symfony/console": "^5.4.11 || ^6.0.11 || ^7"
Each dependency manager uses a different syntax for specifying these requirements. So we have to make a decision on what version we actually record and inject into the prompt.
Currently, X-Ray just parses out the entire range and considers it the "version". So in the PHP composer example above, it would be listed in the prompt as: symfony/console (^5.4.11 || ^6.0.11 || ^7).
We can try to locate a lock file for Python projects and fall back to a non-lock file if we don’t find one.
Although it’s recommended to commit lock files for Python projects, it doesn’t seem like many developers follow that advice. I couldn’t easily find projects with lock files. In the Ruby world, on the other hand, the situation appears to be the opposite. The Gemfile.lock file is usually present. I assume this difference reflects the practices within the Python and Ruby communities.
If we rely only on lock files for Python, many projects might be left without X-Ray support.
@lma-git Since that is the way that it currently works, I'm good with continuing to do that, unless there is an alternative that is better. Would just using the latest version listed be a better solution?
Since that is the way that it currently works, I'm good with continuing to do that, unless there is an alternative that is better. Would just using the latest version listed be a better solution?
I was thinking that the oldest version might be better in terms of code suggestions from the LLM. However, we haven't yet confirmed if the LLM really takes into account the version number anyway.
I was mostly considering whether other domains like groupcomposition analysis's may require a specific version # for each dependency. But I suppose we can cross that bridge later if they decide to hook into this new parser service.
So for now, I agree we can just continue with what Repository X-Ray currently does.
I was mostly considering whether other domains like groupcomposition analysis's may require a specific version # for each dependency. But I suppose we can cross that bridge later if they decide to hook into this new parser service.
@lma-git FYI we indeed do look for the exact version otherwise this carries too high a risk of reporting false positives.
That's why most of our support is based on lockfiles and for the rest we try to hook into the build system when there is just a manifest file. Ssee "How analyzers obtain dependency information section in the documentation for details.
@lma-git that's perfectly fine indeed And additionally, we would like to be more granular in how we support various features out of the same source of data. The SBOM report is feeding many different features, and for instance for the dependency list to work it is not required to know the exact version. Knowing a given component is a dependency of the project is already valuable information, and the declared requirement could give additional hints. Though, in order to do a dependency scan of that component we require the version to be exact.
Following this philosophy would provide a degraded experience when mandatory data is not available but that's better than a binary "all or nothing" approach.
I imagine we set a max to avoid performance issues of searching infinitely deep (which might cause n+1 problems?), but just curious, why 2?
We are setting the limit to 2 for two reasons:
groupcomposition analysis's Dependency Scanning (DS) tool also has this limit per docs. As we are potentially moving towards using DS features in the future, it makes sense to have the same limit.
More importantly, both DS and this ConfigFileParser have the same root performance issue related to the use of Ruby's File.fnmatch, which we use in the base config file class here.
In the case of DS, it relies on the CI config keyword rules:exists which also uses File.fnmatch and is limited to 10,000 glob comparisons. This is because glob pattern matching is a relatively "expensive" operation when executed many times. As we increase the number of config files we support, the number of times File.fnmatch runs will grow quickly. If we limit it to a depth of 2 directories, it will help curb the amount of time it takes to search through all the files.
It's also unlikely that a dependency config file would be deeper than 2 directories, so this should have minimal impact on the feature's coverage.