Skip to content

Identifying which commits belong to the same project

For several research projects, we need to be able to tell whether two commits belong to the same project. For now, the definition that we have in mind for "same project" is whether the two commits have the same set of initial commits (most of the time, that's a single initial commit).

Here are some ideas that we have discussed with various people to achieve this:

  • An idea by Zack was that it is enough to identify the connected components of the subgraph corresponding to the history layer (only the commit nodes and the parent-child edges). On this subgraph, it would be possible to run the existing algorithms for computing strongly connected components (https://docs.rs/webgraph-algo/latest/webgraph_algo/sccs/index.html) after having converted the subgraph into a webgraph (which we can do using https://docs.rs/swh-graph/latest/swh_graph/views/struct.WebgraphAdapter.html) and converted this (directed) graph into a symmetric (undirected) one. Having a connected component identifier is actually not sufficient to identify projects because anyone can create a commit that merges two unrelated histories, thus merging their connected components. However, if we have done the first step of identifying the SCC, we can then check how many initial commits (commits without parents) there are in the SCC. If there is a single initial commit, then we can consider all the commits from the SCC to belong to the same project. If there are several initial commits, then we can go through the SCC by generation to identify which commits have a single ancestor initial commits and which ones have several (and which ones if there are more than two).

  • A possibly simpler idea that we have discussed with Antonin Delpeuch during the CONGRATS days on community health in Lyon was to simply start from the set of initial commits and (in parallel) explore their descendants and "tag" them with the initial commit. This is not a visit because commits which have several ancestor initial commits will be visited and tagged several times. Possible solutions for tagging:

    • a DashMap from commit to set of initial commits
    • a table mapping initial commits to descendant commits that we can build in parallel for each initial commit, reduce into a single table at the end, and then transform into a table mapping commits to the set of their ancestor initial commits.