Track the commit SHA for a project's languages data
The following discussion from !19480 (merged) should be addressed:
-
@nick.thomas started a discussion: (+6 comments) When we drop a worker, either because the feature is disabled or because another worker is already running, we lose information about the generated statistics being out of date. Unlike Geo replication, this information isn't critically important, so perhaps it's not a big deal, but it still makes me a little uncomfortable.
A subsequent
git push
will bring them back up to date, of course.I had a think about alternative things we could do here. Tracking the SHA we've indexed, as we do for elasticsearch and geo, would allow us to make the "out of date statistics" case discoverable. WDYT?
With the merge of !19480 (merged), we start tracking the languages a repository contains. However, there are a number of cases where those statistics can become outdated. GitLab doesn't notice, and so the out-of-date information is displayed to the user as if it's up to date.
Since we don't know the data is out of date, we don't have any strategy to automatically bring it back up to date. Instead, we rely on the user noticing and performing a git push
against the project to bring it up to date again.
Scenarios in which the information in the database can lag behind:
- The
repository_languages
feature flag is enabled, then disabled for a time and re-enabled.git push
notifications while disabled are discarded - Two
git push
commands happen close together. The secondgit push
is discarded
When we generate the repository_languages
statistics, we pass a specific commit ID to Gitaly and ask it to generate statistics for that commit. If we store the commit ID in the database along with the generated statistics, we can check whether any of these things have happened.
Two things we can do with this data:
- Hide the statistics bar on the project page (or make it to say "unknown") if it's more than a few commits out of date
- Automatically schedule update jobs for outdated repositories
This is analogous to what we do with elasticsearch's commit and blob indexing at present - we have an index_statuses
table that tracks the last-indexed SHA.