Tool to mine merge commits

I don't know if it would be in scope for this repository, but it's something I'm interested in working on, so I thought I give you a heads up.

Git merge drivers are algorithms used by git to merge two diverging versions of a same file. They take three versions of the same file (base, left branch and right branch) and are tasked with producing the merged version (potentially with merge conflict markers in it). When developing such merge drivers, it's useful to have large corpora with examples of how people actually do those merges in the wild.

It's often done by crawling git repositories looking for merge commits (restricting it to those with only 2 parents for the sake of simplicity). For each merge commit, we can then check for each file they touch, whether that file also existed in the parent commits.1 If the file is different on two sides, then we can gather those three versions as an example of real merge. The output of a merge driver can then be compared to those test cases. This has been done by crawling GitHub (for instance, see this paper), and it would perhaps be interesting to do it more widely.

In terms of specification, I could imagine something like this:

Arguments:

  • A pattern for merged files of interest (for instance *.java)
  • Various limits (such as the maximum number of cases to gather)

Output: a table with the following columns:

  • commit ids of all three commits involved
  • filename of the file being merged
  • SWHID of the three file versions
  • if it's not to hard to compute, some info about an origin in which this merge commit appears would be nice
  1. This ignores the cases where the file was moved… I guess running rename detection to handle those cases could theoretically be done, but probably not worth it.

Edited by Antonin Delpeuch