Static Reachability - Java Support: Create Maven metadata scraper for popular + vulnerable packages

Overview

For Java static reachability, we need to map import paths (FQNs) to Maven coordinates. For example: com.google.common.collectcom.google.guava/guava. This mapping varies by version as packages can add, remove, or relocate classes between releases.

We need a scraper that extracts these mappings from JAR files for popular and vulnerable Maven packages, storing them for use during static reachability analysis. We should leverage the existing SR scraper for this purpose.

Proposal and Implementation Plan

Target Packages

  • Top 5-10K popular Maven packages (from BigQuery deps.dev)
  • All Maven packages with vulnerabilities in GLAD (~2K packages)
  • All versions of these packages
  • (Probably there is an intersection between these 2 groups)

Implementation Steps

  1. Collect packages
  • Use BigQuery to fetch popular packages, the following query counts actual packages that depend on each Maven package, and is a good indicator for popularity
SELECT Name, COUNT(DISTINCT Dependent.Name) as DependentCount
FROM `bigquery-public-data.deps_dev_v1.DependentsLatest`
WHERE System = 'MAVEN'
GROUP BY Name
ORDER BY DependentCount DESC
LIMIT 5000
  • Extract GLAD Maven vulnerable packages.
  1. For each package:

    • Fetch all versions from Maven Central (metadata.xml / directory structure)
    • For each version, use HTTP range requests to read JAR structure without downloading full file
    • Extract package paths, assume that in a JAR file the package import name directly corresponds to a directory structure in JAR files in Java. (For example, the class com.myco.calc.CalculatorDemo would be stored within calculator.jar file as com/myco/calc/CalculatorDemo.class).
    • Handle edge cases: fat JARs, sources JARs, Android AARs
    • Skip packages that cannot be imported (e.g., JavaScript packages, Maven archetypes)
  2. Build output structure

    • Build Radix tree structure mapping import paths → package:version
    • Compress and write to java-metadata.json
    • Integrate with existing Python scraper infrastructure where applicable

Validate coverage

  • Cross-reference top 5K BigQuery popular packages with SaaS usage data (ignore private SaaS packages - when count is smaller than 50)
  • Calculate coverage percentage of real GitLab usage
  • Adjust package count (5K vs 10K) based on coverage results

Suggested MRs / Tasks breakdown:


→ Use this spike as a reference.

Edited by Orin Naaman