When using the new version of the license scanning feature, it is often observed that the licenses for some dependent packages are marked as unknown.
Steps to reproduce
Add dependencies to the pom file of the Maven project, such as com.github.jsqlparser/jsqlparser and org.jvnet.staxex/stax-ex.
Include the dependency scanning template in the project pipeline.
After the pipeline has completed its run, review the license information in the security center.
Cause
Reviewing the logic, it appears that the process involves generating an SBOM (Software Bill of Materials) file through dependency scanning. Subsequently, GitLab retrieves the corresponding software package license information based on this file.
After reviewing the documentation and pulling the database from GCP (prod-export-license-bucket-1a6c642fc4de57d4) using the gsutil tool, it was discovered that the reason for the "unknown" status is because the license information for these dependencies in the database is not fully marked. For example, the default license marked for com.github.jsqlparser/jsqlparser is unknown and Apache 2.0.
Delving further we found that if the url in the <license><url>...</url></license> section of a package's pom was not in the interfacer's url cache that license is recorded as unknown.
Possible fixes
Update the maven package of the interfacer to a classification by url lookup.
Because packages affected by this fix will continue to have unknown licenses until the package is updated upstream, a one-time migration is necessary.
Implementation plan
1. Update classification
Update maven license identification in the license-interfacer.
Update license-interfacer to add a cmd (i.e. classify_maven_packages.go) which users the interfacer to process a list of packages supplied via cli argument.
Takes a list of cli arguments similar to the interfacer command and an extra comma-delimited poms_urls argument.
Create an Action (e.g. interfacer) which uses pom_urls to instantiate a list of data.Package instances with data.Package.Package pointing to a pom url.
Invoke maven.Handle for each data.Package above to initiate processing.
Update deployment project with a script and job that can invoke the interfacer command above.
Add run_maven_interfacer.sh script using run_feeder.sh as an example using a MAVEN_POM_URLSENV variable to specify the pom_urls above.
Add a run maven interfacer job similar to run feeder with a rule to only execute the script above if an MAVEN_POM_URLS variable is set.
Run the job above via the deployment project Build > Pipeline > Run Pipeline and supplying the MAVEN_POM_URLS argument.
This issue was automatically tagged with the label groupcomposition analysis by TanukiStan, a machine learning classification model, with a probability of 0.97.
If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.
To set expectations, GitLab product managers or team members can't make any promise if they will proceed with this.
However, we believe everyone can contribute,
and welcome you to work on this proposed change, feature or bug fix.
There is a bias for action,
so you don't need to wait. Try and spin up that merge request yourself.
If you need help doing so, we're always open to mentor you
to drive this change.
Taking a look at the SPDX list, there's no exact match of EDL-1.0 or LGPL-2.1. So per the license scanning doc, they are reported as unknown. Is that correct? And if that's the case the way to contribute to the SPDX license list would need to follow the relevant guide then.
Sorry I had overlooked the entry for LGPL-2.1 at SPDX list and it IS in the list. Now I am wondering how the license DB is built and why com.github.jsqlparser/jsqlparser(4.7) does not show the proper licenses? @brytannia
@xiaogang_cn in the past we had situations, when license wasn't processed because there was no alias set up. But For LGPL-2.1 looks like we have one. For a moment, I don't have any other ideas why it can happen.
@xiaogang_cn FYI the extracts from the license db export you've shown above are from the V1 dataset using CSV files that we have replaced since gitlab 16.3 with the V2 dataset using JSON files (NDJson exactly).
@dcroft, I'm not unless we revisit the triage for this issue as we have 20+ P3 bugs in the backlog competing in priority with this one.
I can see that a community contributor is discussing the problem with one of our engineers, so maybe they will end-up contributing a fix before we get to it.
Is there anything in particular that drew your attention to this issue, Daniel?
Yes both @zjkyz8 and @leotusss will be willing to contribute to this. They need to understand how the license DB is built and the process to contribute to it.
@xiaogang_cn here is our development guide for license-db And this is the document about design of this component. Is this something that you were searching for?
There is not much to refine here as this bugs needs to be investigated to understand the root cause. I'm timeboxing this to a day for now and we'll revisit then.
@xiaogang_cn I am not familiar with the implementation of the license classfication and will have to defer to the engineer that will pick this issue up to figure out why it is being classified as both Apache-2.0 and unknown is.
That said, as both licenses are returned, the UI should show the dependency under both Apache-2.0 and Unknown for these versions.
@gonzoyumo you marked this issue as ready for development, but reading the description it's not clear for me what should be done. Do you have an implementation plan in mind?
@thiagocsf@gonzoyumo this issue is needed for JiHu to make progress on their go-to-market licencing. Could have have clarity on who and when this would be completed? Thank you.
Refining this, I found that the license for some of the affected packages is missed by the license-interfacer. The interfacer tries to find the license in 2 ways. The first is done by querying the internal license cache using the license.url field in the library's pom.xml. The 2nd tries classifying the the text in the license.commentsfield.
As an example, the pom.xml for com.github.jsqlparser/jsqlparser:4.8.0 has Apache 2.0 and LGPL 2.1 under the licenses entry. But the url for the latter (http://www.gnu.org/licenses/lgpl-2.1.html) is not present in the internal license cache.
The logs correspond to these misses via unknown license detected events and can be see in this logs explorer screenshot. The events correspond to 373 unique entries: unique_licenses.txt (not all are valid).
To resolve this bug, one option available is to use the log entries above to add all the valid missed licenses to the internal license cache. This would increase the accuracy as a one shot but would require us operationalizing the update so it can be done on a regular basis.
Another option is to switch to an automatic step by using URL classification to query the license url and classify the contents on a url cache miss. A downside to using this approach is an increase in complexity, processing latency, and a potential issue with rate limiting.
@ifrenkel NuGet URL license resolution was something that was delivered by VR as part of the License DB MVC because license URLs are the most common way of communicating licenses for NuGet packages. I made some amendments that allowed it to classify more licenses when I discovered some limitations with it. I think it might be beneficial to add as a fallback for other package registries and may have the side-effect of improving NuGet license classification if it's being exercised by more heavily used registries.
@ifrenkel Thanks for all the information. Data extracted from the logs look awesome. The implementation plan sounds solid. I trust Philips opinion on this one since he is the most experience with the interfacer code base.
This groupcomposition analysis bug has at most 50% of the SLO duration remaining and is an SLONear Miss breach. Please consider taking action before this becomes an SLOMissed in 27 days (2024-02-10).
@philipcunningham@nilieskou could you take a look at the implementation plan for this issue please? Specifically the migration to update existing packages. This is looking like a large amount of work and I'm not certain that it's worth doing just for maven. Do you know what we've done in the past in cases where we need to go back and update existing licenses?
Another option is to create a sqlmigration with the necessary data, but it seems a bit cumbersome to make this one-off update part of the codebase. WDYT?
@ifrenkel Thanks for refining this issue. I am not very familiar with the interfacer part of the license-db but let me write down some questions
Indeed let's not make an sql migration. A migration should mainly be about Schema database changes. However, if needed we can just write a script that we can run on the database without storing it in the Schema project.
Finding the affected packages from the logs sounds a bit painful. I am not sure how well it works.
Did you try it? Can you perhaps give us a snapshot of the results?
Aren't we worried that we do this only for the last 90days? And why 90days? Is it because that is the duration of logs retention?
If I understand correctly you want to extract from the logs packages, versions and pom urls. Then you want to extend the interfacer with a new command which will take the pom urls and fix all those packages that have an unknown version. So this command will be fixing the problem that we have. Is this the idea? And if yes, will we need to run this again in the future? Probably not according to this.
My apologies for answering with more questions Igor.
@ifrenkel I haven't done much of the thinking required to understand updating existing license classifications, so I also have a few questions:
When the Rails application encounters a package with unknown and Apache 2.0, does it treat the overall license classification as unkown?
Is the uknown license classification written to the database used by Rails? If so, if that package was updated to have MIT and Apache 2.0 in the License DB, would that change carry over to Rails?
Under Update already classified packages, would deleting the cursor and re-running the Feeder from scratch have the same effect? If so, would it be simpler to periodically delete/ignore cursors? Please see discussion in this thread.
{"insertId":"657fc8bc0004ca08fc257e94","jsonPayload":{"package_registry":"maven","version":"1.1.18","message":"unknown license detected","license":"GNU Affero General Public License v3.0","time":1702873276,"package":"top.focess/focess-util"...}
90 days was an arbitrary number, but there is a limit to the number of entries you can export easily. After that, you have to build something to consume the rest.
If I understand correctly you want to extract from the logs packages, versions and pom urls. Then you want to extend the interfacer with a new command which will take the pom urls and fix all those packages that have an unknown version. So this command will be fixing the problem that we have. Is this the idea? And if yes, will we need to run this again in the future? Probably not according to this.
That's right @nilieskou, step 1 of the implementation plan fixes classification going forward. Step 2 fixes already (incorrectly) classified entries.
When the Rails application encounters a package with unknown and Apache 2.0, does it treat the overall license classification as unkown?
Is the uknown license classification written to the database used by Rails? If so, if that package was updated to have MIT and Apache 2.0 in the License DB, would that change carry over to Rails?
Under Update already classified packages, would deleting the cursor and re-running the Feeder from scratch have the same effect? If so, would it be simpler to periodically delete/ignore cursors? Please see discussion in this thread.
It's going to show unknown and Apache 2.0 as far as I recall, but I can check further if you recall some edge cases.
The change would carry over, because we will re-export all records for a package once it's updated.
I thought that the lastUpdated would be changed on all of them. If that's not the case then an export all for maven would definitely work better. I will double check this. Thanks!
I thought that the lastUpdated would be changed on all of them. If that's not the case then an export all for maven would definitely work better. I will double check this. Thanks!
@philipcunningham have you been able to check whether this is true? From the code I can see that all packages emitted by the feeder and then interfacer are updated in the database by the processor, but I did this by examining the individual components. Do you have a tip as to how I can check this "live" by running the whole system. Do you normally use dev for these questions?
@philipcunningham if lastUpdated changed then the exporter would re-export the entire dataset into the gcp bucket. The GitLab instance sync is idempotent so there would be no functional change, but all that data would have to be -re-processed. And dedicated instances would have their package_medata/licenses/v2/maven directory double on disk.
Are we missing a trick by extending URL classification to Maven only? I wonder if we could improve classification across other package registries by moving classification to the License Processor. Perhaps in a follow-up or as part of the precision/recall improvement epic
Ultimate Customer here adding the comment based on our issue of license scan for "com.vaadin.addon/vaadin-touchkit-agpl maven plugin which returns the license as unknown instead of expected AGPL 3.0"
We're interested in a solution and here's some additional data for this request:
In your organization, what do you consider to be the highest priority for this feature proposal? According to a scale of 1-10, 1 is the lowest priority and 10 is the highest. --> 7
Why are you interested in this feature? --> it blocks our Customer to handle merge request based on applied policy for AGPL license
What is the problem you are trying to solve? --> to get resolved license of com.vaadin.addon/vaadin-touchkit-agpl plugin
Do you have any workarounds? --> no
What is the impact to your organization of not having this feature? --> it prevents our Customer do perform devops taks the descripted way
We have an MR (internal link) open but are testing whether a change in classification will create a cascade of data updates for a significant subset of maven packages.
The change (internal link) to maven classification has been merged and tested on the staging server (dev). It has identified the licenses of significantly more dependencies (including many of the ones in the original report).
We will wait for Monday (2024-02-05) to push the change to prod.
@xiaogang_cn we had to do an infrastructure update to account for regenerating updates for older packages. We have been testing this. The push to prod is scheduled to happen today. I will post an update.
Today we promoted the maven classification change to the production environment. The full update ran successfully and we expect data to start making it to GitLab instances once the scheduled export updates the gcp bucket.
This change was delayed because a test run (last week) on the staging environment (dev) revealed an issue with a messages on the license-processor topic getting dropped due to the subscriber retention policy and a large backlog of new messages. This was a full update so the number of new messages on the topic was expected. The unexpected part was that some messages were getting dropped.
The resolution for this was to increase the retention in the gcp pub/sub subscriber configuration from 12 hours to 3 days. Re-running the test in staging showed no dropped messages.
The following errors were present in the logs (most of them new since the interfacer). Some are to be expected since we're calling out to the url specified in pom.xml and there are many malformed or misconfigured entries but some need to be investigated further (i.e. i/o timeout, connection refused) to ensure that these network errors aren't coming from the interfacer because of rate limiting, etc.
num records examined: 236405num 500s: 118223num errors: 118182num network errs: 118180 (pctg of total: 99.99830769491123) unexpected HTTP response code: 86169 connection refused: 8980 unsupported protocol scheme: 7945 no such host: 5602 invalid request format: parse: 1873 i/o timeout: 1557 x509: certificate is valid for: 1462 context canceled: 1203 x509: certificate signed by unknown authority: 730 server misbehaving: 630 http: no Host in request: 325 lame referral: 311 x509: certificate has expired: 311 tls: no renegotiation: 264 read: connection reset by peer: 188 EOF: 154 no route to host: 122 read: connection timed out: 119 context deadline exceeded: 116 tls: internal error: 62 tls: handshake failure: 37 stopped after \d+ redirects: 14 connect: network is unreachable: 6 unknown: 2["dead letter failure", "dead letter failure"]
Note: these errors are coming from the maven feeder being run with ignore_cursor (so this is going over all the available maven dataset).
I tracked the unknowns before and after the new interfacer release and we can see that the total number of package licenses increased by ~10% while the ratio of unknowns was reduced by ~1%:
with a as (select count(*) c from maven_license where license_ids='{1}'),b as (select count(*) c from maven_license where license_ids!='{1}'),c as (select count(*) c from maven_license where 1 = any(license_ids))select a.c as all_unknowns,b.c as no_unknowns,c.c as some_unknowns,round(cast (a.c/b.c::float*100 as numeric),2) as pct_unknown,round(cast (c.c/b.c::float*100 as numeric),2) as pct_some_unknownfrom a,b,c;