Spike: Supporting license scanning for Swift

added gitlab-org#15540 (closed) as parent epic

changed the description

added devopssecure featureaddition groupcomposition analysis sectionsec typefeature workflowrefinement labels

mentioned in epic gitlab-org#15540 (closed)

@tkopel - I am updating the milestone for this refinement to %17.7. This is part of an escalation. We can determine implementation timelines once this spike issue is complete.

@johncrowley are we looking to support both Swift Package Manager and CocoaPods? The epic doesn't make this clear. Should the effort be included in this spike?

/cc @tkopel

@YashaRise focus on SPM (Swift Package manager)

changed milestone to %17.7

mentioned in issue #482904 (closed)

assigned to @nilieskou

@johncrowley If I understand correctly the goal of this issue is to update PMDB so that it ingests Swift advisories that are available in GLAD. Then update the Gitlab Rails app to ingest those advisories and be able to create vulnerabilities for Swift packages. Is that correct?

I am also wondering why this is a Spike? Or the Spike is to refine the issues required?

@nilieskou correct, we need to figure out the integration with the data source. Could require crawling.

@nilieskou this is not about the Security Advisories, this part is already done. This issue and epic is about License Scanning support instead. Sorry about that, I did not realize there was some misleading info in the issue description.

You are right @gonzoyumo . So this is a Spike to investigate how we can achieve that. Is clear. I will add in this spike all my investigation efforts.

Sorry @nilieskou - I should have specified - this is only focused on gathering licenses so that we return licenses for Swift dependencies, instead of unknown - which is the current display for any Swift dependency in the Dependency list.

changed the description

Proposal

Data source

SwiftPackageIndex provides information about Swift packages. A list of all available Swift packages can be found in the SwiftPackageIndex/PackageList repo. This is also the data source used by SwiftPackageIndex. You can view information regarding a package by using the swiftpackageindex.com/<github-owner>/<repo>for example Moya/Moya package maps to https://swiftpackageindex.com/Moya/Moya.

SwiftPackageIndex has the following drawbacks:

does not provide an API yet. This is planned for the future.
does not support finding package information for a specific package version. Related question.
does not provide a way to get updates on new packages or new versions of existing packages since a certain point in time.

SwiftPackageIndex assets:

provides us with a list with all Swift packages that is updated multiple times a day.

Fetching package versions

Possible solutions

We can fetch release versions from SwiftPackageIndex. For example for Moya/Moya package we can call https://swiftpackageindex.com/Moya/Moya/releases and then scrap the release versions from the HTML response --> The issue with this approach is that not all tags are releases.
Use the Gihub api --> this will not work because we will hit rate limiting very fast.
Clone the project and list all tags --> This would take too much time.
Reach github/user/repo/tags and scrap the html code using Colly. --> This can be flaky since if the Github url changes then we have an issue. This way takes 1h with 1 thread. Multiple threads will cause 429 and then we need to introduce delays. It is also challenging to get all the versions since the tags page divides tags in multiple pages.
Use git --tags ls-remote to list all tags --> This is really fast. It takes 2m25sec using 20 threads.

Summarising

We could use SwiftPackageIndex/PackageList as a source for all swift projects
We can use git --tags ls-remote to fetch all tags.

Architecture

Update license-feeder to gather data about packages and existing versions. Since we do not have a way to get only new versions updates, we need to get all versions for all packages. In order to avoid sending all versions to the license-interfacer we can persist a ledger with the all the packages and versions. Consequently we can send to license-interfacer only new versions.
License-interfacer matches licenses to package versions. We can do that with license classifiers like we do for other package managers.
License-processor stores swift package license data to the database. Probably we need some new tables in the PMDB db.
License-exporter exports swift license data

Limitations

It seems that all Swift packages are currently hosted in Github. The limitation for now is that we can only process Swift packages that are hosted in Github. If in the future we need to extend our functionality this will be possible.
We do implement removals. The list of swift packages is updated including package removals. We won't be treating those. Same happens for all PURL types.

Risks

~~We will need to change the code if Github changes the tags url~~ --> mitigated by using git --tags ls-remote.
We might get flooded with 429 when we deploy the license-interfacer in cloud run and we run it for the first time. I have been testing the license-interfacer part locally on my laptop and it seems to work although it takes multiple hours to complete. Cloud run will scale fast and it might cause 429s. We could possibly solve this either by not scaling the swift interfacer or manually importing all data once in the database and then only handle deltas which are going to be much smaller.

@dbolkensteyn I am working on introducing Swift license data. Currently AFAIU all swift packages are hosted in Github and we have a list of all the packages here. When it comes to licenses I am trying to fetch the license by reaching: https://raw.githubusercontent.com/<USER>/<REPO>/refs/tags/<TAG>/<LICENSE_FILE> where <LICENSE_FILE> can be one of the following: LICENSE, license, LICENSE.md, license.md, license.txt, licence, licence.md, licence.txt.

So in the worst case scenario I might need to make up to 8 API calls. I saw that for conan you are cloning the repo and find the license in there. I am wondering:

Isn't more time consuming to clone the repo? Why did you choose to clone the repo and not just do what I described? Am I missing something?
For fetching the tags we do something very similar. We call https::github.com/<USER>/<REPO>/tags . I noticed that fetching the tags for all the repos using multiple threads will hit the rate limit. So basically getting 429s. I haven't tested it yet for the license part but I am expecting similar behaviour there. Was this an issue back then when you developed the license-interfacer? Was cloning a better approach with respect to rate limits?

@nilieskou indeed GitHub API calls are rate limited, and require API keys. This is likely one of the reasons we opted for fetching information from Git repos (which GitHub serves very well) rather than through APIs. That being said, Conan supported was implemented by @julianthome, who might be able to provide more context.

Technically speaking Git supports partial clones which can make fetching a few files incredibly fast
Similarly, ls-remote over the Git protocol should not lead to any 429s even if done massively in parallel

API keys need to be rotated and managed, so that's another consideration to keep in mind.

Thanks @dbolkensteyn for the quick response and the great suggestions. I am not intending to use the Github API. Awesome suggestions both 1 and 2. I will start looking into them . @julianthome if you have anything to add that would be extremely useful.

Thanks @dbolkensteyn @nilieskou,

totally agree with what @dbolkensteyn wrote above .

Circumventing the rate limit was one motivation. In addition, I think one special aspect of conan is its size: it is a fairly small database and a full clone on the conan-center-index does not take longer than ~10s which is probably small/good enough for the time being. A side benefit is the simplicity of the implementation.

added workflowin dev label and removed workflowrefinement label

@idawson I am trying to find a way to get version and license information regarding Swift packages. I did some research and the TLDR is that :

SwiftPackageIndex does not provide an API for getting the latest package versions and information
We can use the SwiftPackageIndex/PackageList to get a list of all available swift packages
Using that list we can scrap the tags (from HTML) directly from the github repos by reaching github.com/<user>/<repo>/tags.
I already prototyped this part and it works
Then we can use a similar approach to the golang license-interfacer to identify the License.

WDYT about this approach? Do you know a better way of doing this?

WDYT about this approach? Do you know a better way of doing this?

I think this is probably the best way of going about it, nice job tracking all the info down and coming up with a process. I wonder if we should include https://github.com/SwiftPackageIndex/PackageList/blob/main/denylist.json as well .

thanks @idawson for taking a look .

I wonder if we should include https://github.com/SwiftPackageIndex/PackageList/blob/main/denylist.json as well

I am not sure I fully understand what the denylist is. It looks like a list of packages that are set for removal from the SwiftPackageIndex. However, the majority of the packages in the denylist are not in the allowlist.

@nilieskou we should probably dig into it a little bit, if they are just packages that were removed from the index I think we should scan them too. I think so because they might still exist as components in our customers already generated SBOM list. Adding additional logic to filter them out might be counter productive.

@nilieskou I really like your plan, and kudos for bringing it to such clarity so early on.

@tkopel

Adding additional logic to filter them out might be counter productive.

We will actually add logic to include the denylist. But that should not be too difficult.

I really like your plan

Thanks. For this Spike I will try to finish my PoC proving that we can do this and then I will create issues and refine them for Package Metadata support for Swift (&13578 - closed)

@nilieskou

I meant adding logic that filters packages found in the denylist.

@tkopel I think I have a better understanding of how denylist works. This is a list of packages that are proposed to be removed from the official list of Swift packages. The problem with going through the denylist is that we will be trying to process repos that are deleted. Moreover, we are not supporting package removals in PMDB. That means that once a package is in we do not delete it. At least not in an automated way. So if a package is ingested in PMDB and after a while is removed using the denylist, we will still have that info. So any customer still using a clone of that repo will be able to see the license info. That being said I will not process the denylist

@idawson I would like your feedback about the following risk that I have identified. On my laptop it takes around 1hour to scrap all swift package versions. This is basically the amount of time that the feeder will need every time that it is executed. I tried to perform the whole process with multiple threads but quite fast I start getting 429. Honestly spending 1h for this process is not that bad, meaning performance of the feeder is not that important. Edit: It takes 3 mins using git ls-remote.

The risk comes when the license-interfacer will come in the picture. With one thread on my laptop it takes several hours to get the licenses for all the Swift packages for all versions. Of course this takes place only the first time we run the pipeline for swift as the feeder will only send deltas to the license-feeder.

I am afraid that since the packages with their versions will come in the form of pubsub messages, the license-interfacer will scale since it runs on Cloud Run resulting in 429. My assumption here is that probably most of these requests to Github will have the same IP and hence they will get rate limited. That means that the first time we run the swift pipeline probably the license-interfacer will fail to process all the messages. WDYT?

A possible solution to this problem is to run the whole process locally and store the data in the database. At the same time I can create a first state. As a result when the feeder runs for the first time it will have a cursor. In other words it won't need to send all packages to the license-interfacer but only the ones that have changed. The downside of this is that might be difficult to reproduce. I could write a script though for that purpose.

In the license-interfacer I see that for conan we clone the whole repo. This seems like a more time-consuming process than just fetching directly the license file which we assume that is in the root dir of the repo. I am wondering though why this process is not rate limited. Is this approach better?

I think we might get better result with a shallow clone as suggested here. I will give it a try.

Sorry did you still need my feedback on this issue?

No worries @idawson . I experimented with partial cloning and it looks promising. You can find more info here. So I think I should be ok

Requesting versions using Github API

Tags can be requested using the Github API for fetching references. The problem with the Github API is the rate limiting factor. The rate limits differentiate based on non-authenticated and authenticated requests:

non-authenticated requests: 60requests per hour
authenticated requests: 5000 requests per hour

We currently have 8265 repos and this number will grow in the future.

SwiftPackageIndex is using the github API to fetch all information and is indeed handling the rate limit.

marked this issue as blocking #506730 (closed)

added workflowcomplete label and removed workflowin dev label

added workflowin dev label and removed workflowcomplete label

Requesting Licenses using partial git clones

In order to request licenses for packages we used the following strategy:

Perform a partial git clone in order to avoid cloning big blobs and improve performance
We took the design decision of processing multiple package versions in one go. That means that the feeder is expected to group multiple versions of the same package in one message. This way we can clone the repo once and checkout the various versions.

Result

Using 20 threads on a Macbook it took 2h31m to process all swift packages and versions. This number is not too far away from other PURL types. Please also take into account that we will not be performing this process for all packages. Daily the swift feeder will feed only deltas. We will feed all swift packages (IGNORE_CURSOR) only once per month. In theory the license-interfacer should scale well since it runs on Cloud Run instances.

added workflowcomplete label and removed workflowin dev label

closed

reopened

closed

set weight to 5

Spike: Supporting license scanning for Swift

Summary

Proposal

Designs

Child items ...

Activity

Proposal

Data source

Fetching package versions

Possible solutions

Summarising

Architecture

Limitations

Risks

Requesting versions using Github API

Requesting Licenses using partial git clones

Result

Spike: Supporting license scanning for Swift

Summary

Proposal

Blocks

Activity

Proposal

Data source

Fetching package versions

Possible solutions

Summarising

Architecture

Limitations

Risks

Requesting versions using Github API

Requesting Licenses using partial git clones

Result