Measure Code Churn

As part of https://gitlab.com/gitlab-com/www-gitlab-com/issues/3931 we need to settle on a repeatable method of measuring code churn.

I've gone down two paths that we can look at.

Using Code-maat

Code-maat is a tool for measuring analytics in a VCS such as chrun in a git repository 🎉

To use it we first run:

$ git log --all --numstat --date=short --pretty=format:'--%h--%ad--%aN' --no-renames --after=2018-01-01 > changes.log

In the repository to get a log file with the relevant changes.

We then compile code-maat via the instructions in it's README and run it via

$ java -jar ./target/code-maat-1.1-SNAPSHOT-standalone.jar -l /Users/erushton/Projects/gitlab-com/gitlab-ce/changes.log -c git2 -a entity-churn

Which gives us an output like

entity,added,deleted,commits
spec/fixtures/git-cheat-sheet.pdf,260846,0,2
.yarn/releases/yarn-1.15.2.js,136204,0,1
yarn.lock,42433,31175,401

[snip]

qa/qa.rb,1194,731,187
spec/models/event_spec.rb,1193,515,22
app/assets/javascripts/diffs/store/actions.js,1192,317,92
spec/features/issues/gfm_autocomplete_spec.rb,1192,755,40
app/uploaders/object_storage.rb,1190,707,45

Then we need to filter out irrelevant file types like pngs which we can do by saving the output of the code-maat command above to a csv and grepping out types we want to keep and then summing up the added and deleted rows as well:

$ cat entity_churn.csv | grep -E "(\.js|\.vue|\.rb|\.erb|\.rake|\.yml|\.yaml|\.html|\.haml|\.css)" | awk '{split($0,a,","); adds += a[2]; dels += a[3]} END {print adds, dels}'

Results: 2741651 2097018

Using plain gitlog

There's a similar sort of information we can get from using plain git log by running

git log --numstat --pretty="%H" --after=2018-01-01

We can then grep and awk it to count the plus/minus:

$ git log --numstat --pretty="%H" --after=2018-01-01 | grep -E "(\.js|\.vue|\.rb|\.erb|\.rake|\.yml|\.yaml|\.html|\.haml|\.css)" | awk 'NF==3 {plus+=$1; minus+=$2} END {printf("+%d, -%d\n", plus, minus)}'

+696946, -417645

Keen observers will note that these approaches are showing very different results despite me running them on the same repo. We need to dive in and see what's going on then decide on which approach to take.

The CE repo is our biggest repo presumably and this is fairly fast to run so I don't see why we couldn't use this and avoid a sampling approach. Once a method is settled on we can then run a calculation of the number of lines of code either at the start or end of the period of time and get a percentage of churn that way.

Additionally we could probably add in a few more file types that we like (.toml, .go, etc) and have a generic script we use on any repo and then create a CI job that takes a project as a variable and calculates the churn. Or we just accept that CE dwarfs the work done on the other repos so it's probably an accurate representation by itself.

CC @jhampton @darbyfrey @clefelhocz1