Language stats in GitLab.com repos do not honor .gitattributes annotations
Summary
Within the last 2-3 weeks, the language statistics for one of my private repos have been borked. It appears that the annotations in the .gitattributes
are no longer honored, leading to generated and documentation files being included erroneously in the language stats. I've been able to reproduce this with a minimal example: details below.
Steps to reproduce
- Examine at the language statistics for a repo with a substantial amount of
linguist-generated
and/orlinguist-vendored
code, as indicated in the.gitattributes
file. If the repo hasn't been updated recently, may need to trigger stats recompute by pushing a new commit. - Compare to the language stats reported by running
github-linguist
locally
Example Project
https://gitlab.com/standage/gitlab-linguist-bug
What is the current bug behavior?
The language stats for the project include 13.6% labeled as "Makefile", although the corresponding code has been labeled as documentation in the .gitattributes file. This matches what is reported when running linguist locally after deleting the .gitattributes
file and committing.
$ docker run --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist github-linguist --breakdown
82.61% 9301 Python
13.60% 1531 Makefile
3.79% 427 Perl
Makefile:
Makefile
Perl:
code/get_ce_iso.pl
Python:
code/reader.py
What is the expected correct behavior?
Files annotated as vendored, documentation, or generated should not be included in language stats, as is the case when running linguist locally with the .gitattributes file present.
$ docker run --rm -v $(pwd):$(pwd) -w $(pwd) -t linguist github-linguist --breakdown
95.61% 9301 Python
4.39% 427 Perl
Perl:
code/get_ce_iso.pl
Python:
code/reader.py
Output of checks
This bug happens on GitLab.com.
Proposal
Gitaly should support the overrides as defined by Linguist: https://github.com/github/linguist/blob/master/docs/overrides.md
The go-enry package does not support this out of the box (yet): https://github.com/src-d/enry/issues/18. And if it would, it probably won't help us much as go-enry does not support bare git repositories.
So we should implement handling gitattributes(5) ourself. I don't suggest to build the parsing ourself, but we can use git-check-attr(1)
. It has the option --stdin
so we can bulk process all files we're trying to calculate the stats for. Unfortunately we don't have anything in place yet to do this.
While we're at it, we should use the filtering helpers so we exclude binary files and others.