[BE] Productivity Analytics - Type of Work (pre-defined) - data harvesting

Problem to solve

This particular issue is about gathering Type of Work data only. Check related issues for other parts.

This issue adds a new chart to the Productivity Analytics page as described in https://gitlab.com/gitlab-org/gitlab-ee/issues/12246

image

BE Requirements

We should classify MR code as

  • New (Number of LOC additions),
  • Churn (Number of LOCs changed/deleted that have been previously modified in less than 1 month),
  • Refactoring (Number of LOCs changed/deleted that have been previously modified in more than 1 month).

We can treat MR diffs by blocks and consider code as New if entire block is only added lines and Modified if block contains at least 1 deleted line. Then Modified part should be split into Refactoring or Churn.

Every MR against the repo's default branch should have 3 new metrics according to lines it adds\modifies\removes:

  • new_loc
  • churned_loc
  • refactored_loc

Gitaly enhancements

  • Need to modify Gitaly to support git diff --color-moved in a way that the ANSI color encoding is parsed. Purpose: Detect cut-and-paste line movements.
  • Need to modify Gitaly to support git diff --word-diff (we can modify/subclass Gitlab::Diff::Parser to handle parsing it). Purpose: Detect line modifications and whitespace changes.
  • Want to modify Gitaly to handle git blame -L with multiple -L options. Purpose: Improve performance; it's about 3x faster than ordinary git blame.

Note: Shelling out to git diff is currently possible, but not viable due to security, availability and versioning concerns.

Technical notes

  • For details on determining the "default branch", see def default_branch? in app/services/git/branch_push_service.rb.
Edited by Dan Jensen