Skip to content

Backfill project snippets statistics and refresh related ones

What does this MR do?

In this MR we implement a background migration to create/update the statistics of all project snippets. It also updates the associated column snippets_size in their associated project statistics. Finally, we also need to update the snippets_size field in the root namespace of each project.

The snippet statistics feature is implemented to automatically update the project and namespace statistics. Nevertheless, because the feature was merged iteratively, snippet statistics and the other associated statistics can be in different states. For example:

  • There can be snippets without statistics
  • There can be snippets with statistics but the default values
  • There can be snippets with updated values but that didn't update the associated project statistics
  • There can be snippets with updated values, with associated project statistics updated, but not the namespace ones.
  • ....

That's why we need to ensure to recalculate the associated project and namespace statistics even if the snippet statistics is up to date.

There are around 93013 project snippets in prod. These records belong to 37K projects and those projects belong to 29K namespaces.

We've made several tests with different configurations in staging and each iteration takes around 0.04s. This means that the total time to migrate all project snippets will be around one hour. Nevertheless, because we're performing the migration in batches of 500 elements, the total time will be something like close to 6 hours.

Theoretically, each batch should run in 1.6 mins, nevertheless, we've set a delay interval of 3 mins to give enough room to let Sidekiq and Gitaly to cool off and avoid stressing it.

Ref #223817 (closed)

bin/rake db:migrate

== 20200709101408 SchedulePopulateProjectSnippetStatistics: migrating =========
== 20200709101408 SchedulePopulateProjectSnippetStatistics: migrated (0.0940s)

bin/rake db:rollback

== 20200709101408 SchedulePopulateProjectSnippetStatistics: reverting =========
== 20200709101408 SchedulePopulateProjectSnippetStatistics: reverted (0.0000s)

Query made in the scheduling migration:

SELECT snippets.id
FROM snippets
INNER JOIN projects ON projects.id = snippets.project_id
WHERE snippets.type = 'ProjectSnippet'
ORDER BY projects.namespace_id ASC,
         snippets.project_id ASC,
         snippets.id ASC;

With warm caches, this query takes 0.6ms (query plan).

Explanation about the data distribution in the scheduling migration

This background migration is quite particular because it can potentially trigger some background jobs to update the RootStorageStatistics record associated with the snippet.

We don't want to enqueue a lot of jobs and perform several extra operations. Ideally, we should only trigger one job per namespace. For that we can leverage the idempotent feature in jobs and reduce the number of scheduled jobs.

For example, imagine that we have a project with several snippets. When we update the first one, a new job will be created, nevertheless, when we update the rest, since we already have a similar job in the queue, they won't be enqueued.

So basically we need to gather the data based on the namespace.

We can easily do that and schedule the background migration using a range of namespace ids. But, the distribution of the data is not optimum here and it can backfire us. There can be namespaces with thousands of snippets and others with only one.

We should cheer to have a constant distribution in the background migration. That's why we opted for sorting the snippets by namespace and by project, and pass the snippets ids to the background migration. That way, we always have a constant number of snippets in each batch.

Inside the background migration, we also gather the data around projects because it also allow us to save some db operations.

Does this MR meet the acceptance criteria?

Conformity

Merge request reports