Track the commit SHA for a project's languages data
The following discussion from gitlab-ce!19480 should be addressed:
-
@nick.thomas started a discussion: (+6 comments) When we drop a worker, either because the feature is disabled or because another worker is already running, we lose information about the generated statistics being out of date. Unlike Geo replication, this information isn't critically important, so perhaps it's not a big deal, but it still makes me a little uncomfortable.
A subsequent
git pushwill bring them back up to date, of course.I had a think about alternative things we could do here. Tracking the SHA we've indexed, as we do for elasticsearch and geo, would allow us to make the "out of date statistics" case discoverable. WDYT?
With the merge of gitlab-ce!19480, we start tracking the languages a repository contains. However, there are a number of cases where those statistics can become outdated. GitLab doesn't notice, and so the out-of-date information is displayed to the user as if it's up to date.
Since we don't know the data is out of date, we don't have any strategy to automatically bring it back up to date. Instead, we rely on the user noticing and performing a git push against the project to bring it up to date again.
Scenarios in which the information in the database can lag behind:
- The
repository_languagesfeature flag is enabled, then disabled for a time and re-enabled.git pushnotifications while disabled are discarded - Two
git pushcommands happen close together. The secondgit pushis discarded
When we generate the repository_languages statistics, we pass a specific commit ID to Gitaly and ask it to generate statistics for that commit. If we store the commit ID in the database along with the generated statistics, we can check whether any of these things have happened.
Two things we can do with this data:
- Hide the statistics bar on the project page (or make it to say "unknown") if it's more than a few commits out of date
- Automatically schedule update jobs for outdated repositories
This is analogous to what we do with elasticsearch's commit and blob indexing at present - we have an index_statuses table that tracks the last-indexed SHA.