Static Reachability - Java Support: Create Maven metadata scraper for popular + vulnerable packages
Overview
For Java static reachability, we need to map import paths (FQNs) to Maven coordinates. For example: com.google.common.collect → com.google.guava/guava. This mapping varies by version as packages can add, remove, or relocate classes between releases.
We need a scraper that extracts these mappings from JAR files for popular and vulnerable Maven packages, storing them for use during static reachability analysis. We should leverage the existing SR scraper for this purpose.
Proposal and Implementation Plan
Target Packages
- Top 5-10K popular Maven packages (from BigQuery deps.dev)
- All Maven packages with vulnerabilities in GLAD (~2K packages)
- All versions of these packages
- (Probably there is an intersection between these 2 groups)
Implementation Steps
- Collect packages
- Use BigQuery to fetch popular packages, the following query counts actual packages that depend on each Maven package, and is a good indicator for popularity
SELECT Name, COUNT(DISTINCT Dependent.Name) as DependentCount
FROM `bigquery-public-data.deps_dev_v1.DependentsLatest`
WHERE System = 'MAVEN'
GROUP BY Name
ORDER BY DependentCount DESC
LIMIT 5000
- Extract GLAD Maven vulnerable packages.
-
For each package:
- Fetch all versions from Maven Central (
metadata.xml/ directory structure) - For each version, use HTTP range requests to read JAR structure without downloading full file
- Extract package paths, assume that in a JAR file the package import name directly corresponds to a directory structure in JAR files in Java. (For example, the class
com.myco.calc.CalculatorDemowould be stored withincalculator.jarfile ascom/myco/calc/CalculatorDemo.class). - Handle edge cases: fat JARs, sources JARs, Android AARs
- Skip packages that cannot be imported (e.g., JavaScript packages, Maven archetypes)
- Fetch all versions from Maven Central (
-
Build output structure
- Build Radix tree structure mapping import paths → package:version
- Compress and write to
java-metadata.json - Integrate with existing Python scraper infrastructure where applicable
Validate coverage
- Cross-reference top 5K BigQuery popular packages with SaaS usage data (ignore private SaaS packages - when count is smaller than 50)
- Calculate coverage percentage of real GitLab usage
- Adjust package count (5K vs 10K) based on coverage results
Suggested MRs / Tasks breakdown:
-
Refactor https://gitlab.com/gitlab-org/security-products/license-db/static-reachability-modules-scraper/-/merge_requests/11+s -
Package list collection (BigQuery popular packages + GLAD packages extraction) -
JAR scraping implementation -
Radix tree output generation & static-reachability-metadata repo integration -
Implement the scraper itself: -
https://gitlab.com/gitlab-org/security-products/license-db/static-reachability-modules-scraper/-/merge_requests/21+s -
https://gitlab.com/gitlab-org/security-products/license-db/static-reachability-modules-scraper/-/merge_requests/20+s Tasks: -
Implement scraper/maven -
Add a CLI flag -
Update run_scraper.sh -
Update Readme.md -
Add a changelog entry
-
-
-
Create a scheduled pipeline on the scraper that generates Maven data.
→ Use this spike as a reference.
Edited by Orin Naaman