Track the commit SHA for a project's languages data

The following discussion from gitlab-ce!19480 should be addressed:

  • @nick.thomas started a discussion: (+6 comments)

    When we drop a worker, either because the feature is disabled or because another worker is already running, we lose information about the generated statistics being out of date. Unlike Geo replication, this information isn't critically important, so perhaps it's not a big deal, but it still makes me a little uncomfortable.

    A subsequent git push will bring them back up to date, of course.

    I had a think about alternative things we could do here. Tracking the SHA we've indexed, as we do for elasticsearch and geo, would allow us to make the "out of date statistics" case discoverable. WDYT?

With the merge of gitlab-ce!19480, we start tracking the languages a repository contains. However, there are a number of cases where those statistics can become outdated. GitLab doesn't notice, and so the out-of-date information is displayed to the user as if it's up to date.

Since we don't know the data is out of date, we don't have any strategy to automatically bring it back up to date. Instead, we rely on the user noticing and performing a git push against the project to bring it up to date again.

Scenarios in which the information in the database can lag behind:

  • The repository_languages feature flag is enabled, then disabled for a time and re-enabled. git push notifications while disabled are discarded
  • Two git push commands happen close together. The second git push is discarded

When we generate the repository_languages statistics, we pass a specific commit ID to Gitaly and ask it to generate statistics for that commit. If we store the commit ID in the database along with the generated statistics, we can check whether any of these things have happened.

Two things we can do with this data:

  • Hide the statistics bar on the project page (or make it to say "unknown") if it's more than a few commits out of date
  • Automatically schedule update jobs for outdated repositories

This is analogous to what we do with elasticsearch's commit and blob indexing at present - we have an index_statuses table that tracks the last-indexed SHA.