Determine high churn and commit count files, and investigate if they can be refactored to improve the development
We recently installed various guidelines for code reuse and such, and applied this in a few places. Instead of randomly applying these techniques to existing files, I want to take a more analytical approach.
GitLab is a large application, and while a lot of files will change over time, like most applications there will be a core set of files that changes (much) more often than the others. We measure changes in two ways:
- The number of commits per files
- The number of changed lines, known as "churn"
Any changes made to these files are likely to have a big impact, as they affect a lot of developers (we assume not all changes are made by the same people). This means that any refactoring improvements made can positively impact developers, possibly even reducing the amount of time necessary to work on these files.
The goal would be to identify high churn files, and try to figure out what specific regions of those files change a lot. For example, I expect that of many of these files usually only small areas change (e.g. a method is added), whereas total rewrites will be very rare. With these files identified we should then be able to determine the "frustration" involved in making changes.
For example, if the exact same method changes a lot of times, it is usually an indication of a method that does way too much. For very young startups this could also happen when business requirements radically change, but GitLab is long past that point. If we find such code, we can figure out how to perhaps split that up, spreading churn around more, possibly even reducing it.
The first step in this entire process is to get some data on which files change often, how many lines we add versus remove, etc. With that data in mind I also want to see if we can correlate changes in these files to bugs. This could be done by retrieving all MRs that changed a particular file, then see if there are any associated bug issues. The more bugs per file, the more important it becomes to investigate refactoring it.