The License Compliance for some of our projects exhibit a very large number of Unknown licenses, even though the license is not ambiguous on package manager (npmjs). For instance, most of Angular and GraphQL packages turn up as Unknown. It makes it impossible to implement license compliance as we would constantly need to re-review up to a few hundreds of Unknown licenses.
One such example is "word-wrap" package and an almost identical fork "@aashutoshrathi/word-wrap". One is listed as Unknown while the other shows up as MIT. There is no significant differences between the two projects, in particular for the license part.
A minimal project was created using Gitlab template for Nodejs Express. The security template from Gitlab.com was added and the yaml changed to have a test stage. References to word-wrap were added in app.js. The lock file was updated and committed.
Note that the project is not functional, it is only used to demonstrate the issue with minimal changes to template.
What is the current bug behavior?
The License Compliance and Dependency List tabs show word-wrap as Unknown and @aashutoshrathi/word-wrap as MIT.
What is the expected correct behavior?
Both should show MIT. And hopefully by extension, in our projects, we should see all Angular, GraphQL, and other such instances flagged with their corresponding licenses (provided their licensing is clear in the package manager...).
Relevant logs and/or screenshots
No relevant logs were found in jobs artifacts.
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
See the example project based on Gitlab template, there is no customer-specific environment setup.
This issue was automatically tagged with the label groupvulnerability research by TanukiStan, a machine learning classification model, with a probability of 0.85.
If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.
To set expectations, GitLab product managers or team members can't make any promise if they will proceed with this.
However, we believe everyone can contribute,
and welcome you to work on this proposed change, feature or bug fix.
There is a bias for action,
so you don't need to wait. Try and spin up that merge request yourself.
If you need help doing so, we're always open to mentor you
to drive this change.
Thoughts on this @mhenriksen@thiagocsf . Is this more for VR or for composition analysis to determine which team should be responsible for looking into this?
@wayne it sounds to me like it's in composition analysis' domain. The VR team has not been involved much in the license database after the handoff of the initial version. I remember we designed it to fall back to Unknown license if there was any doubt, and not try to be too clever about guessing it, as showing a wrong license would be worse than showing Unknown.
We will of course gladly assist with any debugging if needed!
I'm not refining further this issue but suggest an investigation timeboxed to 1 day to verify the DB content and if the problem is related to the other npm issue.
@martin.levesque I do not have access to the dependency list and license compliance pages in your existing project so I can't confirm it's all fixed there. Please reopen this issue if that's not the case.
One gotcha that we need to follow-up on is the fact that groupthreat insights has started to store detected licenses along with components in their own DB tables. This acts as a caching mechanism for the results provided by groupcomposition analysis features and we need to further clarify how that contextual stored data can be refreshed. This should not yet be a problem for the project level dependency list and license compliance page, but will soon be with the completion of Use database for project dependency list (&8293 - closed). This should impact the Group level depenedncy list though.
@gonzoyumo, I could not check with the test project as my trial expired, but looking at projects we have internally, the number of unknown licenses has gone down to a manageable level.
However, there are still a couple of unknown license apart from our internal dependencies (which we expect anyways to be listed as unknown).
Most of them have alternate licensing, so that's probably the reason for "unknown", but there are a few that have an obvious license that still pop up:
I've checked webpack and indeeed the last known version for this package in our DB is 5.85.0 while the latest on the registry is 5.90.0. I've checked with the team and there is actually a pending decision on the resync of full npm data, similar to the maven discussion in #433541 (comment 1750040839).
I'm reopening this issue and will close it once this is addressed.
This groupcomposition analysis bug has at most 50% of the SLO duration remaining and is an SLONear Miss breach. Please consider taking action before this becomes an SLOMissed in 3 days (2024-02-13).
I've spent time today diving into the Package Metadata DB projects and here are is a summary (all actions have been executed on the DEV environement):
I've sent a test message on the pubsub topic dev-package-interfacer-topic-dev-npm-interfacer-cloud-run with following payload to trigger the interfacer logic:
Thank you, @gonzoyumo. I think you did a good job at isolating the issue and I agree with your assessment that it indicates an issue with replication. I can see from looking at the scheduled pipeline page that the verification job associated with the NPM scheduled pipeline is failing:
$ ./test/end-to-end-npm-license-feeder.shOfficial registry count (2678172) and replica count (2624454) differ by 53718 documents.Delta is not within tolerance (1500). Please check replication is running as expected
I will take a look today to see if there is anything obvious.
The issue appears to be that the credentials were rotated but the replication task relies on the credentials. I've confirmed this by fixing replication between prod and dev and will to do the same for dev.
@gonzoyumo I've confirmed that replication has now resumed. It will take some time to resync but, as soon as I see it has completed, I will manually trigger the NPM Feeder. This will also give some indication regarding the delta.
$ ./test/end-to-end-npm-license-feeder.shOfficial registry count (2679032) and replica count (2679287) differ by 255 documents.Delta is within tolerance (1500).
@philipcunningham I've tried today to setup some local replication and focusing on syncing a single document (webpack) but without success. I've managed to setup a full replication but not really kean to go that route I was hoping to leverage the selector object as documented but keep getting timeout from replicate.npmjs.com (even when setting crazy timeout) when using this method.
[error] 2024-02-19T19:02:37.290639Z nonode@nohost <0.19278.4> -------- Replicator, request POST to "https://replicate.npmjs.com/registry/_changes?filter=_selector&feed=normal&style=all_docs&since=0&timeout=1666666" failed due to error {connection_closed,mid_stream}[error] 2024-02-19T19:03:07.007785Z nonode@nohost <0.19176.4> -------- ChangesReader process died with reason: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}[error] 2024-02-19T19:03:07.008240Z nonode@nohost <0.19176.4> -------- Replication `484b9f7d51ebaf72ad24a59c79dc249f` (`https://replicate.npmjs.com/registry/` -> `http://172.17.0.3:5984/test_replication/`) failed: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}
@gonzoyumo have you tried testing out replication of another package with our replica? It would let you test out the process with a fast DB instance first before switching it out to try with webpack on the public registry. It might help identify if there's something particular about the webpack package in the public registry (e.g. a checkpoint issue).
Thanks @philipcunningham. Unfortunately none of the above made a difference :/
When looking one last time at the documentation after these failures, I luckily found the doc_ids option which unfortunately was not mentionned in the replication documentation
It worked like a charm and I was able to finally see an error for webpack sync
So it seems the webpack document is just too big for the replication to handle it! It's a bit annoying that the replication considers it went fine and silents this problem :/
I'll wait for the next feeder execution to see if it picks up the webpack package and add it to our cloudSQL DB. Then after the next exporter run I'll check that it's in our GCP bucket data and that the GitLab instance correctly detects the license for webpack.
@gonzoyumo it makes we wonder if it would be beneficial to re-run replication from the beginning now that you've adjusted this setting. I think it could offer improved precision for our customers on one of the most popular package ecosystems. What do you think?
@philipcunningham that's a good idea. I was wondering how we could figure out which npm documents are above that 8MB threshold or any other way to identify the currently ~255 missing documents.
But simply re-running a full replication might just do it indeed
Before triggering a full re-sync, we might try to figure out what would be the appropriate value to set for max_document_size. I've put 16MB so far but checking on some other problematic packages I can see we can get far above this:
@graphql-codegen_cli: 26MB
vite: 38MB
@philipcunningham before going that route, do you know if there is any risk or constraint on the infrastructure side if we increase the couchDB disk usage?
Considering the fact that we are currently missing 255 documents, the worst case scenario would be 255xmax_document_size MB of additional storage needed. Let's get some margin and say we pick 64MB, that's ~16GB more.
On the ohter hand the best case scenario would we 255x8MB (current limit) so ~2GB.
Current DB size is 47.4GB so that means we would go with an increase of ~4% to ~30%, for a final size between 50GB and 64GB.
I recall storage is marginal on our overal operational cost so I'd go for it but @thiagocsf feel free to correct me on this.
About how to trigger a full-resync, we could probably just trigger a one-time replication without any filter on doc_ids. This replication would run concurrently to the continous one that is already in place and thus do not prevent updates emited during the full re-sync. I'm still checking but haven't found any specific guidance on that topic in the documentation.
@philipcunningham before going that route, do you know if there is any risk or constraint on the infrastructure side if we increase the couchDB disk usage?
@gonzoyumo I'm not aware of any issues and we are OK for disk utilization on both instances:
About how to trigger a full-resync, we could probably just trigger a one-time replication without any filter on doc_ids. This replication would run concurrently to the continous one that is already in place and thus do not prevent updates emited during the full re-sync. I'm still checking but haven't found any specific guidance on that topic in the documentation.
I think this sounds like a good suggestion. Two things spring to mind:
The increased traffic from the same machine will mean we'll start hitting the public registry's rate limiting sooner. This is probably OK but we should keep an eye on it to make sure that it isn't resulting in the document count delta increasing.
It might be worth identifying a "known-bad" package to validate that the replication was successful.
I've looked at the couchDB logs on the VM and there I was able to extract a list of 450 packages for which the error Too large was raised (see below).
So on top of completely missing documents, we also have an increasing number of documents for which the sync no longer worked as they go over the limit.
So instead of a full sync, I can now focus the replication on these docs only.
I had to sync in batches of ~50 to ~100 documents otherwise the replication queries timed out :/ I've done 6 batches on the DEV environement and the PROD automatically from it:
dev: 2685958 total docs
prod: 2685956 total docs
The size of the DB has reached 50 GB, so we're actually close to the best case scenario here. That said, a few packages are still above that 64MB limit. So far, 14 documents still caused the Too large error in the logs:
package
size
@primer/react
116MB
sfdx-hardis
106 MB
@redwoodjs/cli
105MB
@carbon/ibmdotcom-web-components
103 MB
@c8y/ngx-components
98.5MB
@salesforce/cli
91.9MB
@thirdweb-dev/react
87.4MB
binaryen
86.6MB
@typescript-eslint/eslint-plugin"
75.0MB
renovate
71.6MB
hls.js
70.1MB
rubic-sdk
67MB
nocodb-daily
63.6MB
quiz-api-client
16.3 MB
I've raised the limit to 128MB and will try to resync these.
I also just realized that the initial diff of 255 documents with the official npm registry that @philipcunninghamreported above was actually not missing documents on our ends, it looks like we have more documents on our replica
Querying the replicate.npmjs.com couchdb at the same time as our instances I get:
"doc_count":2685030, "doc_del_count":1534160
So we have indeed ~1000 docs more than the official registry, while missing less than 10 deleted documents...
I also just realized that the initial diff of 255 documents with the official npm registry that @philipcunninghamreported above was actually not missing documents on our ends, it looks like we have more documents on our replica
@gonzoyumo that is interesting. Perhaps we could improve the communication in that test. WDYT?
@philipcunningham it's probably just my fault not paying enough attention and assuming we were behind as the whole context of this issue is about mising documents
Olivier Gonzalezchanged title from Many npm package licenses reported as Unknown even if license type is unambiguous in npmjs to Many npm package licenses reported as Unknown even if license type is unambiguous in npmjs (couchDB replication issue)
changed title from Many npm package licenses reported as Unknown even if license type is unambiguous in npmjs to Many npm package licenses reported as Unknown even if license type is unambiguous in npmjs (couchDB replication issue)
Though, it seems these docs are below the current limit and trying to re-sync these actually highlighted other timeout errors.
[error] 2024-02-26T17:58:56.595369Z couchdb@127.0.0.1 <0.5644.1403> -------- Replicator, request GET to "https://replicate.npmjs.com/registry/%40thirdweb-dev%2Freact?atts_since=%5B%221932-7e79d8eb63462dccf6d27927f96a7733%22%5D&revs=true&open_revs=%5B%221943-ed3efcc3b488f5c9fae6412384a35cdc%22%5D&latest=true" failed due to error timeout
OS Process Error <0.11431.1403> :: {os_process_error,"OS process timed out."}
I went ahead and set [couchdb]os_process_timeout to 180000 (3 minutes) too. Though, the error is still raised. I'll open a follow-up issue to further investigate this one as there are several other packages that raise such error.