License Scan Appears "Unknown"

added 1 deleted label

changed the description

This issue was automatically tagged with the label groupcomposition analysis by TanukiStan, a machine learning classification model, with a probability of 0.97.

If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.

If you are unsure about the correct group, please do not leave the issue without a group label. Please refer to GitLab's shared responsibility functionality guidelines for more information on how to triage this kind of issues.

Authors who do not have permission to update labels can leave the issue to be triaged by group leaders initially assigned by TanukiStan

This message was generated automatically. You're welcome to improve it.

added automation:ml groupcomposition analysis labels

added devopssecure sectionsec labels

Hey @leotusss, thank you for creating this issue!

To get the right eyes on your issue more quickly, we encourage you to follow the issue triage best practices.

PLEASE NOTE: You will need to use the @gitlab-bot label command to apply labels to this issue.

To set expectations, GitLab product managers or team members can't make any promise if they will proceed with this. However, we believe everyone can contribute, and welcome you to work on this proposed change, feature or bug fix. There is a bias for action, so you don't need to wait. Try and spin up that merge request yourself.

If you need help doing so, we're always open to mentor you to drive this change.

This message was generated automatically. You're welcome to improve it.

added automation:self-triage-encouraged label

I have the same issue, so if anyone can explain the contribution guideline about license DB metadata, I will be very appreciate

mentioned in issue #412315 (closed)

@brytannia This might be similar to #412315 (closed) while I do see the license information at https://mvnrepository.com/artifact/com.github.jsqlparser/jsqlparser/4.7 per &8492 (closed). Any thoughts?

@xiaogang_cn if I understand correctly the description, some of the packages have assigned double license unknown + real one. Is this correct?

Yes for com.github.jsqlparser/jsqlparser(4.7) which has 2 entries in the offline license db export.

maven % grep -R "com.github.jsqlparser/jsqlparser,4.7" *
1693663261/000000000.csv:com.github.jsqlparser/jsqlparser,4.7,
1693663261/000000000.csv:com.github.jsqlparser/jsqlparser,4.7,Apache-2.0

So unknown and Apache 2.0 look correct per license db. The point is that https://mvnrepository.com/artifact/com.github.jsqlparser/jsqlparser/4.7 reports it with Apache 2.0 and LGPL 2.1. The same is shown in the source code repo at https://github.com/JSQLParser/JSqlParser#license.

Another one in the issue description has it marked unknown as per the DB while it's reported as EDL 1.0 at https://mvnrepository.com/artifact/org.jvnet.staxex/stax-ex/1.8.3.

maven % grep -R "org.jvnet.staxex/stax-ex,1.7.3" *
1678297909/000000193.csv:org.jvnet.staxex/stax-ex,1.7.3,unknown
1682431347/000000023.csv:org.jvnet.staxex/stax-ex,1.7.3,

Taking a look at the SPDX list, there's no exact match of EDL-1.0 or LGPL-2.1. So per the license scanning doc, they are reported as unknown. Is that correct? And if that's the case the way to contribute to the SPDX license list would need to follow the relevant guide then.

Sorry I had overlooked the entry for LGPL-2.1 at SPDX list and it IS in the list. Now I am wondering how the license DB is built and why com.github.jsqlparser/jsqlparser(4.7) does not show the proper licenses? @brytannia

@xiaogang_cn in the past we had situations, when license wasn't processed because there was no alias set up. But For LGPL-2.1 looks like we have one. For a moment, I don't have any other ideas why it can happen.

@xiaogang_cn FYI the extracts from the license db export you've shown above are from the V1 dataset using CSV files that we have replaced since gitlab 16.3 with the V2 dataset using JSON files (NDJson exactly).

The scripts in the offline environment documentation have been recently fixed to only download V2 files.

Thanks @gonzoyumo. Yes V2 with ndjson has something similar for this specific case.

maven % grep -R "com.github.jsqlparser/jsqlparser" *
1690221385/000000029.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.6.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1690297343/000000029.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.6.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1690383726/000000029.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.6.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1693666872/000000000.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.7.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}

changed milestone to %Backlog

added bugfunctional label

added typebug label

added priority3 severity3 labels

removed automation:ml label

removed automation:self-triage-encouraged label

Added the same severity and priority as Maven packages with unknown license in GitLab L... (#412315 - closed)

@thiagocsf wondering if you are able to assign a milestone to this issue?

@dcroft, I'm not unless we revisit the triage for this issue as we have 20+ P3 bugs in the backlog competing in priority with this one.

I can see that a community contributor is discussing the problem with one of our engineers, so maybe they will end-up contributing a fix before we get to it.

Is there anything in particular that drew your attention to this issue, Daniel?

Yes both @zjkyz8 and @leotusss will be willing to contribute to this. They need to understand how the license DB is built and the process to contribute to it.

@xiaogang_cn here is our development guide for license-db And this is the document about design of this component. Is this something that you were searching for?

@brytannia, the project is private. We might need to take this one ourselves.

Bumping to S2 due to "unacceptably complex workaround".

added backend workflowrefinement labels

marked this issue as related to #412315 (closed)

changed milestone to %16.8

added customer priority2 severity2 labels and removed priority3 severity3 labels

@gonzoyumo, could you please refine this issue?

The goal of refinement is to ensure an issue is ready to be worked on.

Bot policy.

There is not much to refine here as this bugs needs to be investigated to understand the root cause. I'm timeboxing this to a day for now and we'll revisit then.

assigned to @gonzoyumo

mentioned in issue #432258 (closed)

Added https://gitlab.com/gitlab-com/sec-sub-department/section-sec-request-for-help/-/issues/162 as related to this issue.

/cc @cmutua

mentioned in issue gitlab-com/Product#12955 (closed)

Data available for `com.github.jsqlparser/jsqlparser`

Looking at what we have in the Package Metadata Database (external license DB), we have 2 different licenses for com.github.jsqlparser/jsqlparser:

versions 0.8.5 to 0.9.3 have been classified as unknown
versions 0.9.4 to 4.7 have been classified as Apache-2.0 and unknown

See query details

postgres=> select last_updated,version, identifier, full_name from maven_license join license on license.id= ANY(maven_license.license_ids) where component_id=13222891;
         last_updated          | version | identifier |     full_name      
-------------------------------+---------+------------+--------------------
 2023-04-24 19:22:52.822437+00 | 0.8.5   |            | NOASSERTION
 2023-04-24 19:22:52.822437+00 | 0.8.6   |            | NOASSERTION
 2023-04-24 19:21:23.541776+00 | 0.8.8   |            | NOASSERTION
 2023-04-24 19:22:52.989534+00 | 0.8.9   |            | NOASSERTION
 2023-04-24 19:22:26.655808+00 | 0.9     |            | NOASSERTION
 2023-04-24 19:22:48.317802+00 | 0.9.1   |            | NOASSERTION
 2023-04-24 19:21:23.767689+00 | 0.9.2   |            | NOASSERTION
 2023-04-24 19:22:53.657097+00 | 0.9.3   |            | NOASSERTION
 2023-04-24 19:22:48.26642+00  | 0.9.4   |            | NOASSERTION
 2023-04-24 19:22:48.26642+00  | 0.9.4   | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:48.202151+00 | 0.9.5   |            | NOASSERTION
 2023-04-24 19:22:48.202151+00 | 0.9.5   | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:47.885247+00 | 0.9.6   |            | NOASSERTION
 2023-04-24 19:22:47.885247+00 | 0.9.6   | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:52.802101+00 | 0.9.7   |            | NOASSERTION
 2023-04-24 19:22:52.802101+00 | 0.9.7   | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:18.404057+00 | 1.0     |            | NOASSERTION
 2023-04-24 19:22:18.404057+00 | 1.0     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:52.814462+00 | 1.1     |            | NOASSERTION
 2023-04-24 19:22:52.814462+00 | 1.1     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:23:08.388358+00 | 1.2     |            | NOASSERTION
 2023-04-24 19:23:08.388358+00 | 1.2     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:23:08.190037+00 | 1.3     |            | NOASSERTION
 2023-04-24 19:23:08.190037+00 | 1.3     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:52.784334+00 | 1.4     |            | NOASSERTION
 2023-04-24 19:22:52.784334+00 | 1.4     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:52.767829+00 | 2.0     |            | NOASSERTION
 2023-04-24 19:22:52.767829+00 | 2.0     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:53.657097+00 | 2.1     |            | NOASSERTION
 2023-04-24 19:22:53.657097+00 | 2.1     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:47.830124+00 | 3.0     |            | NOASSERTION
 2023-04-24 19:22:47.830124+00 | 3.0     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:50.802405+00 | 3.1     |            | NOASSERTION
 2023-04-24 19:22:50.802405+00 | 3.1     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:52.802101+00 | 3.2     |            | NOASSERTION
 2023-04-24 19:22:52.802101+00 | 3.2     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:48.116602+00 | 4.0     |            | NOASSERTION
 2023-04-24 19:22:48.116602+00 | 4.0     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:48.289897+00 | 4.1     |            | NOASSERTION
 2023-04-24 19:22:48.289897+00 | 4.1     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:23:08.517362+00 | 4.2     |            | NOASSERTION
 2023-04-24 19:23:08.517362+00 | 4.2     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:48.116602+00 | 4.3     |            | NOASSERTION
 2023-04-24 19:22:48.116602+00 | 4.3     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:54.540978+00 | 4.4     |            | NOASSERTION
 2023-04-24 19:22:54.540978+00 | 4.4     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:48.116602+00 | 4.5     |            | NOASSERTION
 2023-04-24 19:22:48.116602+00 | 4.5     | Apache-2.0 | Apache License 2.0
 2023-04-24 19:22:47.801445+00 | 4.6     |            | NOASSERTION
 2023-04-24 19:22:47.801445+00 | 4.6     | Apache-2.0 | Apache License 2.0
 2023-09-02 04:29:18.109182+00 | 4.7     |            | NOASSERTION
 2023-09-02 04:29:18.109182+00 | 4.7     | Apache-2.0 | Apache License 2.0
(52 rows)

I can also see this information synced with my local instance of GitLab (GDK).

Yes that's why I noticed from the license db export but where is our data source? The MVN repo is that https://mvnrepository.com/artifact/com.github.jsqlparser/jsqlparser/4.7 reports it with Apache 2.0 and LGPL 2.1. The same is shown in the source code repo at https://github.com/JSQLParser/JSqlParser#license.

@xiaogang_cn I am not familiar with the implementation of the license classfication and will have to defer to the engineer that will pick this issue up to figure out why it is being classified as both Apache-2.0 and unknown is.

That said, as both licenses are returned, the UI should show the dependency under both Apache-2.0 and Unknown for these versions.

Yes from the UI perspective it shows the correct result for Apache-2.0 and Unknown. If the external data source is https://mvnrepository.com/artifact/com.github.jsqlparser/jsqlparser/4.7 it should have Apache 2.0 and LGPL 2.1 in the license db but the reality is not.

@xiaogang_cn apologies for overlooking the LGPL 2.1 one and thank you for confirming the behavior .

added workflowready for development label and removed workflowrefinement label

set weight to 2

unassigned @gonzoyumo

@gonzoyumo you marked this issue as ready for development, but reading the description it's not clear for me what should be done. Do you have an implementation plan in mind?

@thiagocsf @gonzoyumo this issue is needed for JiHu to make progress on their go-to-market licencing. Could have have clarity on who and when this would be completed? Thank you.

cc @xiaogang_cn @qzhaogitlab @kbychu

@meks, this is waiting for an engineer to become available so I can't tell you who will pick it up.

We have other ongoing bugs with the same/higher priority in 16.8, so it's likely that this will slip to %16.9.

Hi @meks, I wonder you might have wanted to mention a different name? (Not @qzhaogitlab?)

@brytannia please see this comment for reasoning: #433541 (comment 1695329777)

I should have made this clear in the issue description.

EDIT: added to description: #433541 (closed)

Refining this, I found that the license for some of the affected packages is missed by the license-interfacer. The interfacer tries to find the license in 2 ways. The first is done by querying the internal license cache using the license.url field in the library's pom.xml. The 2nd tries classifying the the text in the license.comments field.

As an example, the pom.xml for com.github.jsqlparser/jsqlparser:4.8.0 has Apache 2.0 and LGPL 2.1 under the licenses entry. But the url for the latter (http://www.gnu.org/licenses/lgpl-2.1.html) is not present in the internal license cache.

The logs correspond to these misses via unknown license detected events and can be see in this logs explorer screenshot. The events correspond to 373 unique entries: unique_licenses.txt (not all are valid).

To resolve this bug, one option available is to use the log entries above to add all the valid missed licenses to the internal license cache. This would increase the accuracy as a one shot but would require us operationalizing the update so it can be done on a regular basis.

Another option is to switch to an automatic step by using URL classification to query the license url and classify the contents on a url cache miss. A downside to using this approach is an increase in complexity, processing latency, and a potential issue with rate limiting.

@philipcunningham you added this functionality to the nuget interfacer as part of License DB detects no licenses for majority of ... (#393572 - closed). What do you think about using this path to resolve the issue?

@philipcunningham you added this functionality to the nuget interfacer as part of License DB detects no licenses for majority of ... (#393572 - closed). What do you think about using this path to resolve the issue?

@ifrenkel NuGet URL license resolution was something that was delivered by VR as part of the License DB MVC because license URLs are the most common way of communicating licenses for NuGet packages. I made some amendments that allowed it to classify more licenses when I discovered some limitations with it. I think it might be beneficial to add as a fallback for other package registries and may have the side-effect of improving NuGet license classification if it's being exercised by more heavily used registries.

@ifrenkel Thanks for all the information. Data extracted from the logs look awesome. The implementation plan sounds solid. I trust Philips opinion on this one since he is the most experience with the interfacer code base.

changed the description

GitLab Ultimate customer running into this issue:

zd ticket - internal only

assigned to @ifrenkel

mentioned in issue #432259 (closed)

removed weight

groupcomposition analysis Engineering Manager, @johncrowley and @willmeek,

This groupcomposition analysis bug has at most 50% of the SLO duration remaining and is an SLONear Miss breach. Please consider taking action before this becomes an SLOMissed in 27 days (2024-02-10).

changed due date to February 10, 2024

added SLONear Miss label

mentioned in issue gitlab-com/Product#13036 (closed)

changed milestone to %16.9

added missed:16.8 label

changed the description

added workflowrefinement label and removed missed:16.8 workflowready for development labels

@philipcunningham @nilieskou could you take a look at the implementation plan for this issue please? Specifically the migration to update existing packages. This is looking like a large amount of work and I'm not certain that it's worth doing just for maven. Do you know what we've done in the past in cases where we need to go back and update existing licenses?

Another option is to create a sql migration with the necessary data, but it seems a bit cumbersome to make this one-off update part of the codebase. WDYT?

@ifrenkel Thanks for refining this issue. I am not very familiar with the interfacer part of the license-db but let me write down some questions

Indeed let's not make an sql migration. A migration should mainly be about Schema database changes. However, if needed we can just write a script that we can run on the database without storing it in the Schema project.
Finding the affected packages from the logs sounds a bit painful. I am not sure how well it works.
- Did you try it? Can you perhaps give us a snapshot of the results?
- Aren't we worried that we do this only for the last 90days? And why 90days? Is it because that is the duration of logs retention?
If I understand correctly you want to extract from the logs packages, versions and pom urls. Then you want to extend the interfacer with a new command which will take the pom urls and fix all those packages that have an unknown version. So this command will be fixing the problem that we have. Is this the idea? And if yes, will we need to run this again in the future? Probably not according to this.

My apologies for answering with more questions Igor.

@ifrenkel I haven't done much of the thinking required to understand updating existing license classifications, so I also have a few questions:

When the Rails application encounters a package with unknown and Apache 2.0, does it treat the overall license classification as unkown?
Is the uknown license classification written to the database used by Rails? If so, if that package was updated to have MIT and Apache 2.0 in the License DB, would that change carry over to Rails?
Under Update already classified packages, would deleting the cursor and re-running the Feeder from scratch have the same effect? If so, would it be simpler to periodically delete/ignore cursors? Please see discussion in this thread.

My apologies too for adding more questions.

Finding the affected packages from the logs sounds a bit painful. I am not sure how well it works.

Did you try it? Can you perhaps give us a snapshot of the results?

Aren't we worried that we do this only for the last 90days? And why 90days? Is it because that is the duration of logs retention?

@nilieskou here is a small snapshot using the logs query in the implementation plan:

Snapshot (the json file is about ~5700 entries).

file structure jq '.[0]' unknown_license_logs.json:

{
  "insertId": "657fc8bc0004ca08fc257e94",
  "jsonPayload": {
  "package_registry": "maven",
  "version": "1.1.18",
  "message": "unknown license detected",
  "license": "GNU Affero General Public License v3.0",
  "time": 1702873276,
  "package": "top.focess/focess-util"
...
}

sample of packages `q -c '.[] |[.jsonPayload.package, .jsonPayload.version]' unknown_license_logs.json | sort | uniq:

["ai.traceable.agent/javaagent","1.1.5"]
["app.screeb.sdk/survey","2.0.27"]
["app.screeb.sdk/survey","2.0.28"]
["chat.sdk/core","5.6.7"]
["chat.sdk/firebase-adapter","5.6.7"]
["chat.sdk/firebase-app","5.6.7"]
["chat.sdk/firebase-push","5.6.7"]
["chat.sdk/firebase-upload","5.6.7"]
["chat.sdk/firestream","5.6.7"]
["chat.sdk/firestream-adapter","5.6.7"]
["chat.sdk/firestream-app","5.6.7"]
["chat.sdk/firestream-firestore","5.6.7"]
["chat.sdk/firestream-realtime","5.6.7"]
["chat.sdk/guru-common","5.6.7"]
["chat.sdk/guru-firestore","5.6.7"]
["chat.sdk/guru-licensing","5.6.7"]
["chat.sdk/guru-realtime","5.6.7"]
["chat.sdk/mod-auto","5.6.7"]
["chat.sdk/mod-firebase-ui","5.6.7"]
["chat.sdk/mod-message-location","5.6.7"]
["chat.sdk/mod-ui-extras","5.6.7"]
["chat.sdk/pro-contact-book","5.6.7"]
["chat.sdk/pro-encryption","5.6.7"]
["chat.sdk/pro-firebase-blocking","5.6.7"]
["chat.sdk/pro-firebase-last-online","5.6.7"]
["chat.sdk/pro-firebase-nearby-users","5.6.7"]
["chat.sdk/pro-firebase-read-receipts","5.6.7"]
["chat.sdk/pro-firebase-typing-indicator","5.6.7"]
["chat.sdk/pro-firestream-blocking","5.6.7"]
["chat.sdk/pro-firestream-read-receipts","5.6.7"]
["chat.sdk/pro-firestream-typing-indicator","5.6.7"]
["chat.sdk/pro-message-audio","5.6.7"]
["chat.sdk/pro-message-file","5.6.7"]
["chat.sdk/pro-message-sticker","5.6.7"]
["chat.sdk/pro-message-video","5.6.7"]
["chat.sdk/pro-xmpp-adapter","5.6.7"]
["chat.sdk/pro-xmpp-read-receipts","5.6.7"]
["chat.sdk/ui","5.6.7"]
["chat.sdk/vendor","5.6.7"]
["chat.sdk/vendor-android-audio-recorder","5.6.7"]
["chat.sdk/vendor-chat-kit","5.6.7"]
["chat.sdk/xmpp-app","5.6.7"]
["cloud.piranha.arquillian/piranha-arquillian-jarcontainer","24.1.0"]
["cloud.piranha.arquillian/piranha-arquillian-server","24.1.0"]
["cloud.piranha.arquillian/project","24.1.0"]
["cloud.piranha.classloader/project","24.1.0"]
["cloud.piranha.core/piranha-core-api","24.1.0"]
["cloud.piranha.core/piranha-core-impl","24.1.0"]
["cloud.piranha.core/project","24.1.0"]
["cloud.piranha.dist/piranha-dist-coreprofile","24.1.0"]
["cloud.piranha.dist/piranha-dist-isolated","24.1.0"]
["cloud.piranha.dist/piranha-dist-micro","24.1.0"]
["cloud.piranha.dist/piranha-dist-microprofile","24.1.0"]
["cloud.piranha.dist/piranha-dist-platform","24.1.0"]
["cloud.piranha.dist/piranha-dist-server","24.1.0"]
["cloud.piranha.dist/piranha-dist-servlet","24.1.0"]
["cloud.piranha.dist/piranha-dist-webprofile","24.1.0"]
["cloud.piranha.dist/project","24.1.0"]
["cloud.piranha.extension/piranha-extension-annotationscan","24.1.0"]
["cloud.piranha.extension/piranha-extension-annotationscan-classfile","24.1.0"]
["cloud.piranha.extension/piranha-extension-apache-fileupload","24.1.0"]
["cloud.piranha.extension/piranha-extension-bytesstreamhandler","24.1.0"]
["cloud.piranha.extension/piranha-extension-coreprofile","24.1.0"]
["cloud.piranha.extension/piranha-extension-default-datasource","24.1.0"]
["cloud.piranha.extension/piranha-extension-eclipselink","24.1.0"]
["cloud.piranha.extension/piranha-extension-eleos","24.1.0"]
["cloud.piranha.extension/piranha-extension-exousia","24.1.0"]
["cloud.piranha.extension/piranha-extension-hazelcast","24.1.0"]
["cloud.piranha.extension/piranha-extension-micro","24.1.0"]
["cloud.piranha.extension/piranha-extension-microprofile","24.1.0"]
["cloud.piranha.extension/piranha-extension-naming","24.1.0"]
["cloud.piranha.extension/piranha-extension-naming-cdi","24.1.0"]
["cloud.piranha.extension/piranha-extension-platform","24.1.0"]
["cloud.piranha.extension/piranha-extension-policy","24.1.0"]
["cloud.piranha.extension/piranha-extension-scinitializer","24.1.0"]
["cloud.piranha.extension/piranha-extension-security-jakarta","24.1.0"]
["cloud.piranha.extension/piranha-extension-security-servlet","24.1.0"]
["cloud.piranha.extension/piranha-extension-servlet","24.1.0"]
["cloud.piranha.extension/piranha-extension-servletannotations","24.1.0"]
["cloud.piranha.extension/piranha-extension-soteria","24.1.0"]
["cloud.piranha.extension/piranha-extension-tempdir","24.1.0"]
["cloud.piranha.extension/piranha-extension-transact","24.1.0"]
["cloud.piranha.extension/piranha-extension-wasp","24.1.0"]
["cloud.piranha.extension/piranha-extension-webprofile","24.1.0"]
["cloud.piranha.extension/piranha-extension-webxml","24.1.0"]
["cloud.piranha.extension/piranha-extension-welcomefile","24.1.0"]
["cloud.piranha.extension/piranha-extension-weld","24.1.0"]
["cloud.piranha.extension/project","24.1.0"]
["cloud.piranha.feature/piranha-feature-api","24.1.0"]
["cloud.piranha.feature/piranha-feature-crac","24.1.0"]
["cloud.piranha.feature/piranha-feature-exitonstop","24.1.0"]
["cloud.piranha.feature/piranha-feature-http","24.1.0"]
["cloud.piranha.feature/piranha-feature-https","24.1.0"]
["cloud.piranha.feature/piranha-feature-impl","24.1.0"]
["cloud.piranha.feature/piranha-feature-isolatedwebapp","24.1.0"]
["cloud.piranha.feature/piranha-feature-logging","24.1.0"]
["cloud.piranha.feature/piranha-feature-pid","24.1.0"]
["cloud.piranha.feature/piranha-feature-webapp","24.1.0"]
["cloud.piranha.feature/piranha-feature-webapps","24.1.0"]
["cloud.piranha.feature/project","24.1.0"]


jq '.[0]' unknown_license_logs.json

90 days was an arbitrary number, but there is a limit to the number of entries you can export easily. After that, you have to build something to consume the rest.

If I understand correctly you want to extract from the logs packages, versions and pom urls. Then you want to extend the interfacer with a new command which will take the pom urls and fix all those packages that have an unknown version. So this command will be fixing the problem that we have. Is this the idea? And if yes, will we need to run this again in the future? Probably not according to this.

That's right @nilieskou, step 1 of the implementation plan fixes classification going forward. Step 2 fixes already (incorrectly) classified entries.

When the Rails application encounters a package with unknown and Apache 2.0, does it treat the overall license classification as unkown?

Is the uknown license classification written to the database used by Rails? If so, if that package was updated to have MIT and Apache 2.0 in the License DB, would that change carry over to Rails?

Under Update already classified packages, would deleting the cursor and re-running the Feeder from scratch have the same effect? If so, would it be simpler to periodically delete/ignore cursors? Please see discussion in this thread.

@philipcunningham

It's going to show unknown and Apache 2.0 as far as I recall, but I can check further if you recall some edge cases.
The change would carry over, because we will re-export all records for a package once it's updated.
I thought that the lastUpdated would be changed on all of them. If that's not the case then an export all for maven would definitely work better. I will double check this. Thanks!

I thought that the lastUpdated would be changed on all of them. If that's not the case then an export all for maven would definitely work better. I will double check this. Thanks!

@philipcunningham have you been able to check whether this is true? From the code I can see that all packages emitted by the feeder and then interfacer are updated in the database by the processor, but I did this by examining the individual components. Do you have a tip as to how I can check this "live" by running the whole system. Do you normally use dev for these questions?

@ifrenkel I haven't had the opportunity to check this yet. Can you help me understand why it would be problematic if lastUpdated was changed?

@philipcunningham if lastUpdated changed then the exporter would re-export the entire dataset into the gcp bucket. The GitLab instance sync is idempotent so there would be no functional change, but all that data would have to be -re-processed. And dedicated instances would have their package_medata/licenses/v2/maven directory double on disk.

Non-blocking

Are we missing a trick by extending URL classification to Maven only? I wonder if we could improve classification across other package registries by moving classification to the License Processor. Perhaps in a follow-up or as part of the precision/recall improvement epic

added workflowin dev label and removed workflowrefinement label

Dear Gitlab Team,

Ultimate Customer here adding the comment based on our issue of license scan for "com.vaadin.addon/vaadin-touchkit-agpl maven plugin which returns the license as unknown instead of expected AGPL 3.0"

We're interested in a solution and here's some additional data for this request:

In your organization, what do you consider to be the highest priority for this feature proposal? According to a scale of 1-10, 1 is the lowest priority and 10 is the highest. --> 7

Why are you interested in this feature? --> it blocks our Customer to handle merge request based on applied policy for AGPL license

What is the problem you are trying to solve? --> to get resolved license of com.vaadin.addon/vaadin-touchkit-agpl plugin

Do you have any workarounds? --> no

What is the impact to your organization of not having this feature? --> it prevents our Customer do perform devops taks the descripted way

Are there any questions/notes you have for the product manager? --> please review our support request if it's valid for this issue https://support.gitlab.com/hc/en-us/requests/494477

Many Thanks & BR,

Stan

Update 2024-01-30

We have an MR (internal link) open but are testing whether a change in classification will create a cascade of data updates for a significant subset of maven packages.

Update 2024-02-02

The change (internal link) to maven classification has been merged and tested on the staging server (dev). It has identified the licenses of significantly more dependencies (including many of the ones in the original report).

We will wait for Monday (2024-02-05) to push the change to prod.

Hi @ifrenkel has the update happened? I just synced with the offline DB export and got the same for com.github.jsqlparser/jsqlparser as before.

Query details

maven % grep -R "com.github.jsqlparser/jsqlparser" *
1690221385/000000029.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.6.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1690297343/000000029.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.6.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1690383726/000000029.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.6.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1693666872/000000000.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.7.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}
1703776065/000000000.ndjson:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.8.0","default_licenses":["unknown","Apache-2.0"],"other_licenses":[{"licenses":["unknown"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}

@xiaogang_cn we had to do an infrastructure update to account for regenerating updates for older packages. We have been testing this. The push to prod is scheduled to happen today. I will post an update.

Update 2024-02-07

Today we promoted the maven classification change to the production environment. The full update ran successfully and we expect data to start making it to GitLab instances once the scheduled export updates the gcp bucket.

This change was delayed because a test run (last week) on the staging environment (dev) revealed an issue with a messages on the license-processor topic getting dropped due to the subscriber retention policy and a large backlog of new messages. This was a full update so the number of new messages on the topic was expected. The unexpected part was that some messages were getting dropped.

The resolution for this was to increase the retention in the gcp pub/sub subscriber configuration from 12 hours to 3 days. Re-running the test in staging showed no dropped messages.

@xiaogang_cn the data exporter ran and GitLab instances can now sync the latest data. Example with one of the packages reported.

grep -rn 'com.github.jsqlparser/jsqlparser' vendor/package_metadata/licenses/v2/maven

vendor/package_metadata/licenses/v2/maven/1707404562/000000006.ndjson:3828:{"name":"com.github.jsqlparser/jsqlparser","lowest_version":"0.9.4","highest_version":"4.8.0","default_licenses":["unknown","Apache-2.0","LGPL-2.1","LGPL-2.1+","LGPL-2.1-only","LGPL-2.1-or-later"],"other_licenses":[{"licenses":["unknown","LGPL-2.1","LGPL-2.1+","LGPL-2.1-only","LGPL-2.1-or-later"],"versions":["0.8.5","0.8.6","0.8.8","0.8.9","0.9","0.9.1","0.9.2","0.9.3"]}]}

You may see records there corresponding to earlier timestamps, but the last one is the canonical record.

Please do not hesitate to reopen this issue if you are still finding mis-identified licenses.

Note: there is also an extra unknown license still reported. This corresponds to a bug: #441180 (closed)

Thank you so much @ifrenkel for the update! Yes I got the same after the syncing my instance wit the latest db export.

mentioned in issue #439786 (closed)

mentioned in issue #434008 (closed)

mentioned in issue #439485 (closed)

marked this issue as blocking #439485 (closed)

added workflowstaging label and removed workflowin dev label

marked this issue as related to #439788

added workflowverification label and removed workflowstaging label

Error analysis after release

The following errors were present in the logs (most of them new since the interfacer). Some are to be expected since we're calling out to the url specified in pom.xml and there are many malformed or misconfigured entries but some need to be investigated further (i.e. i/o timeout, connection refused) to ensure that these network errors aren't coming from the interfacer because of rate limiting, etc.

num records examined: 236405
num 500s: 118223
num errors: 118182
num network errs: 118180 (pctg of total: 99.99830769491123)
  unexpected HTTP response code: 86169
  connection refused: 8980
  unsupported protocol scheme: 7945
  no such host: 5602
  invalid request format: parse: 1873
  i/o timeout: 1557
  x509: certificate is valid for: 1462
  context canceled: 1203
  x509: certificate signed by unknown authority: 730
  server misbehaving: 630
  http: no Host in request: 325
  lame referral: 311
  x509: certificate has expired: 311
  tls: no renegotiation: 264
  read: connection reset by peer: 188
  EOF: 154
  no route to host: 122
  read: connection timed out: 119
  context deadline exceeded: 116
  tls: internal error: 62
  tls: handshake failure: 37
  stopped after \d+ redirects: 14
  connect: network is unreachable: 6
  unknown: 2
["dead letter failure", "dead letter failure"]

Note: these errors are coming from the maven feeder being run with ignore_cursor (so this is going over all the available maven dataset).

Improvements in accuracy

I tracked the unknowns before and after the new interfacer release and we can see that the total number of package licenses increased by ~10% while the ratio of unknowns was reduced by ~1%:

Before: ~1.1M package licenses and 4.29% unknowns

all_unknowns | no_unknowns | some_unknowns | pct_unknown | pct_some_unknown
--------------+-------------+---------------+-------------+------------------
510592 |    11906292 |        603418 |        4.29 |             5.07

After: ~1.3M package licenses and 3.03% unknowns

all_unknowns | no_unknowns | some_unknowns | pct_unknown | pct_some_unknown
--------------+-------------+---------------+-------------+------------------
379568 |    12530944 |        626843 |        3.03 |             5.00

SQL query

with a as (select count(*) c from maven_license where license_ids='{1}'),
b as (select count(*) c from maven_license where license_ids!='{1}'),
c as (select count(*) c from maven_license where 1 = any(license_ids))
select a.c as all_unknowns,
b.c as no_unknowns,
c.c as some_unknowns,
round(cast (a.c/b.c::float*100 as numeric),2) as pct_unknown,
round(cast (c.c/b.c::float*100 as numeric),2)  as pct_some_unknown
from a,b,c;

added workflowproduction label and removed workflowverification label

closed

added workflowcomplete label and removed workflowproduction label

License Scan Appears "Unknown"

Description

Steps to reproduce

Cause

Possible fixes

Implementation plan

1. Update classification

2. Update already classified packages

Designs

Child items ...

Activity

Data available for `com.github.jsqlparser/jsqlparser`

Update 2024-01-30

Update 2024-02-02

Update 2024-02-07

Error analysis after release

Improvements in accuracy

License Scan Appears "Unknown"

Description

Steps to reproduce

Cause

Possible fixes

Implementation plan

1. Update classification

2. Update already classified packages

Blocks

Relates to

Activity

Data available for com.github.jsqlparser/jsqlparser

Update 2024-01-30

Update 2024-02-02

Update 2024-02-07

Error analysis after release

Improvements in accuracy

Data available for `com.github.jsqlparser/jsqlparser`