Decide how to measure changes in lines and line modifications
Problem
Need a single number to represent "lines changed": For Code Hotspots, Productivity Analytics, and possibly other features, we plan to measure changes in lines as a single number. But that's not as simple as it sounds, because that is not how git measures changes.
git tracks 2 things separately: lines added and lines removed. To combine those two numbers into a single number we need an agreed approach. Yes, we could simply add the two numbers together. But that would mean all line modifications would count as 2 changes. And humans clearly disagree with that - the human who modified this line would say only 1 line changed, despite git saying 1 line removed and 1 line added:
- -Status: Original
- +Status: Modified
Need a measure of "modified lines": For Pre-defined Type of Work, and possibly other features, we plan to measure line modifications separately from lines added and removed. But again, git only tracks lines added and removed, not lines modified. Here we definitely need need some way of determining which combinations of added and removed lines to classify as modified lines.
The are multiple approaches we could use. So prepare yourself for a wild intellectual ride through the fascinating world of measuring changes in lines!
Terminology note
The common term "lines of code" (abbreviated "LOC") is not accurate for our purposes. We don't seem to have any intention of excluding non-code lines like comments and empty lines. So we should probably say "lines", not "lines of code".
Some ways to measure changes by lines
- "Show separate sums": Assume nothing and always show additions and deletions separately; Modifying 1 line counts as 1 line removed and 1 line added. (This is not a solution for us, as it does not allow us to calculate a single number of lines changed, or modifications.)
- "Assume no modifications": Assume additions and deletions in the same place are separate; Modifying 1 line counts as 2 lines changed.
- "Assume modifications": Assume removals and additions in the same place are modifications; Modifying 1 line counts as 1 line changed. (This requires evaluating diffs by change groups)
- "Calculate modifications": Assume removals and additions in the same place are modifications if analysis using something like
git diff --word-diff, similar_text) or diff-so-fancy determines the lines are sufficiently similar.
Comparables
All these ways are in used in the wild:
-
git diff --statandgit log --statuse both "assume no modifications" and "show separate sums" -
git diff --word-diffuses "calculate modifications" - GitClear uses something even more than "assume modifications" ("We look at changes like additions, updates, and removals")
- GitHub Code Frequency and Contributors charts use "show separate sums"
- Microsoft TFS 2010 uses "assume modifications" or "calculate modifications"
Alternatives to making this decision
We could modify all charts and features to display added and removed separately, and never talk about "modified". Aside from being challenging, this seems like a lost opportunity to provide value, because it leaves users guessing how many added lines are actually double-counted line modifications.
Discussion
I think "show separate sums" and "assume no modifications" are both badly flawed for the same reason: they double-count modifications. I really don't think we should use either.
I think "calculate modifications" is best for accuracy, but is it technically feasible to word-diff every commit? What if we did something crazy like write an extension for git?
I think "assume modifications" is second-best for accuracy. It's more realistic, because when lines are removed and added in the same place they are more often modifications than separately removed and added lines (what "assume no modifications" assumes). As a result it's more fair, because modifying a line is as difficult as adding a line, not 2x more difficult (as "assume no modifications" implies). But is it technically feasible to analyze every diff by change group as necessary?
Which approach is best for us? Please debate!