@tkopel - I am updating the milestone for this refinement to %17.7. This is part of an escalation. We can determine implementation timelines once this spike issue is complete.
@johncrowley are we looking to support both Swift Package Manager and CocoaPods? The epic doesn't make this clear. Should the effort be included in this spike?
@johncrowley If I understand correctly the goal of this issue is to update PMDB so that it ingests Swift advisories that are available in GLAD. Then update the Gitlab Rails app to ingest those advisories and be able to create vulnerabilities for Swift packages. Is that correct?
I am also wondering why this is a Spike? Or the Spike is to refine the issues required?
@nilieskou this is not about the Security Advisories, this part is already done. This issue and epic is about License Scanning support instead. Sorry about that, I did not realize there was some misleading info in the issue description.
Sorry @nilieskou - I should have specified - this is only focused on gathering licenses so that we return licenses for Swift dependencies, instead of unknown - which is the current display for any Swift dependency in the Dependency list.
SwiftPackageIndex provides information about Swift packages. A list of all available Swift packages can be found in the SwiftPackageIndex/PackageList repo. This is also the data source used by SwiftPackageIndex. You can view information regarding a package by using the swiftpackageindex.com/<github-owner>/<repo>for example Moya/Moya package maps to https://swiftpackageindex.com/Moya/Moya.
provides us with a list with all Swift packages that is updated multiple times a day.
Fetching package versions
Possible solutions
We can fetch release versions from SwiftPackageIndex. For example for Moya/Moya package we can call https://swiftpackageindex.com/Moya/Moya/releases and then scrap the release versions from the HTML response --> The issue with this approach is that not all tags are releases.
Clone the project and list all tags --> This would take too much time.
Reach github/user/repo/tags and scrap the html code using Colly. --> This can be flaky since if the Github url changes then we have an issue. This way takes 1h with 1 thread. Multiple threads will cause 429 and then we need to introduce delays. It is also challenging to get all the versions since the tags page divides tags in multiple pages.
Use git --tags ls-remote to list all tags --> This is really fast. It takes 2m25sec using 20 threads.
We can use git --tags ls-remote to fetch all tags.
Architecture
Update license-feeder to gather data about packages and existing versions. Since we do not have a way to get only new versions updates, we need to get all versions for all packages. In order to avoid sending all versions to the license-interfacer we can persist a ledger with the all the packages and versions. Consequently we can send to license-interfacer only new versions.
License-interfacer matches licenses to package versions. We can do that with license classifiers like we do for other package managers.
License-processor stores swift package license data to the database. Probably we need some new tables in the PMDB db.
License-exporter exports swift license data
Limitations
It seems that all Swift packages are currently hosted in Github. The limitation for now is that we can only process Swift packages that are hosted in Github. If in the future we need to extend our functionality this will be possible.
We do implement removals. The list of swift packages is updated including package removals. We won't be treating those. Same happens for all PURL types.
Risks
We will need to change the code if Github changes the tags url --> mitigated by using git --tags ls-remote.
We might get flooded with 429 when we deploy the license-interfacer in cloud run and we run it for the first time. I have been testing the license-interfacer part locally on my laptop and it seems to work although it takes multiple hours to complete. Cloud run will scale fast and it might cause 429s. We could possibly solve this either by not scaling the swift interfacer or manually importing all data once in the database and then only handle deltas which are going to be much smaller.
@dbolkensteyn I am working on introducing Swift license data. Currently AFAIU all swift packages are hosted in Github and we have a list of all the packages here. When it comes to licenses I am trying to fetch the license by reaching: https://raw.githubusercontent.com/<USER>/<REPO>/refs/tags/<TAG>/<LICENSE_FILE> where <LICENSE_FILE> can be one of the following: LICENSE, license, LICENSE.md, license.md, license.txt, licence, licence.md, licence.txt.
So in the worst case scenario I might need to make up to 8 API calls. I saw that for conan you are cloning the repo and find the license in there. I am wondering:
Isn't more time consuming to clone the repo? Why did you choose to clone the repo and not just do what I described? Am I missing something?
For fetching the tags we do something very similar. We call https::github.com/<USER>/<REPO>/tags . I noticed that fetching the tags for all the repos using multiple threads will hit the rate limit. So basically getting 429s. I haven't tested it yet for the license part but I am expecting similar behaviour there. Was this an issue back then when you developed the license-interfacer? Was cloning a better approach with respect to rate limits?
@nilieskou indeed GitHub API calls are rate limited, and require API keys. This is likely one of the reasons we opted for fetching information from Git repos (which GitHub serves very well) rather than through APIs. That being said, Conan supported was implemented by @julianthome, who might be able to provide more context.
Technically speaking Git supports partial clones which can make fetching a few files incredibly fast
Similarly, ls-remote over the Git protocol should not lead to any 429s even if done massively in parallel
API keys need to be rotated and managed, so that's another consideration to keep in mind.
Thanks @dbolkensteyn for the quick response and the great suggestions. I am not intending to use the Github API. Awesome suggestions both 1 and 2. I will start looking into them . @julianthome if you have anything to add that would be extremely useful.
Circumventing the rate limit was one motivation. In addition, I think one special aspect of conan is its size: it is a fairly small database and a full clone on the conan-center-index does not take longer than ~10s which is probably small/good enough for the time being. A side benefit is the simplicity of the implementation.
I am not sure I fully understand what the denylist is. It looks like a list of packages that are set for removal from the SwiftPackageIndex. However, the majority of the packages in the denylist are not in the allowlist.
@nilieskou we should probably dig into it a little bit, if they are just packages that were removed from the index I think we should scan them too. I think so because they might still exist as components in our customers already generated SBOM list. Adding additional logic to filter them out might be counter productive.
@nilieskou I really like your plan, and kudos for bringing it to such clarity so early on.
@tkopel I think I have a better understanding of how denylist works. This is a list of packages that are proposed to be removed from the official list of Swift packages. The problem with going through the denylist is that we will be trying to process repos that are deleted. Moreover, we are not supporting package removals in PMDB. That means that once a package is in we do not delete it. At least not in an automated way. So if a package is ingested in PMDB and after a while is removed using the denylist, we will still have that info. So any customer still using a clone of that repo will be able to see the license info. That being said I will not process the denylist
@idawsonI would like your feedback about the following risk that I have identified. On my laptop it takes around 1hour to scrap all swift package versions. This is basically the amount of time that the feeder will need every time that it is executed. I tried to perform the whole process with multiple threads but quite fast I start getting 429. Honestly spending 1h for this process is not that bad, meaning performance of the feeder is not that important. Edit: It takes 3 mins using git ls-remote.
The risk comes when the license-interfacer will come in the picture. With one thread on my laptop it takes several hours to get the licenses for all the Swift packages for all versions. Of course this takes place only the first time we run the pipeline for swift as the feeder will only send deltas to the license-feeder.
I am afraid that since the packages with their versions will come in the form of pubsub messages, the license-interfacer will scale since it runs on Cloud Run resulting in 429. My assumption here is that probably most of these requests to Github will have the same IP and hence they will get rate limited. That means that the first time we run the swift pipeline probably the license-interfacer will fail to process all the messages. WDYT?
A possible solution to this problem is to run the whole process locally and store the data in the database. At the same time I can create a first state. As a result when the feeder runs for the first time it will have a cursor. In other words it won't need to send all packages to the license-interfacer but only the ones that have changed. The downside of this is that might be difficult to reproduce. I could write a script though for that purpose.
In the license-interfacer I see that for conan we clone the whole repo. This seems like a more time-consuming process than just fetching directly the license file which we assume that is in the root dir of the repo. I am wondering though why this process is not rate limited. Is this approach better?
Tags can be requested using the Github API for fetching references. The problem with the Github API is the rate limiting factor. The rate limits differentiate based on non-authenticated and authenticated requests:
In order to request licenses for packages we used the following strategy:
Perform a partial git clone in order to avoid cloning big blobs and improve performance
We took the design decision of processing multiple package versions in one go. That means that the feeder is expected to group multiple versions of the same package in one message. This way we can clone the repo once and checkout the various versions.
Result
Using 20 threads on a Macbook it took 2h31m to process all swift packages and versions. This number is not too far away from other PURL types. Please also take into account that we will not be performing this process for all packages. Daily the swift feeder will feed only deltas. We will feed all swift packages (IGNORE_CURSOR) only once per month. In theory the license-interfacer should scale well since it runs on Cloud Run instances.