ProjectDailyStatisticsWorker deduplication strategy

ProjectDailyStatisticsWorker more often than not gets way too many accumulated jobs in the project_daily_statistics queue:

ProjectDailyStatisticsWorker is called with a single argument, which is the project_id, presenting itself as a duplicated quite often with our duplicate-of new log:

https://log.gprd.gitlab.net/goto/6ca7a1341fae85518e44aa61b2a1510e

Turns out we cannot deduplicate this, it requires all scheduled jobs to correctly manage the fetch_count column counting on the state of previous runs. So it counts on each job to run exactly once.

      increment_fetch_count_sql = <<~SQL
        INSERT INTO #{table_name} (project_id, date, fetch_count)
        VALUES (#{project.id}, '#{Date.today}', 1)
        ON CONFLICT (project_id, date) DO UPDATE SET fetch_count = #{table_name}.fetch_count + 1
      SQL

This job is scheduled in a quite hot-path, at git_upload_pack endpoint, so seeing so many "duplicates" is expected here.

So in summary, today we have:

A git fetch/pull happens
ProjectDailyStatisticsWorker is scheduled
The job updates a row with fetch_count + 1

Reason to do this work:

The ProjectDailyStatisticsWorker processed 500k "duplicate" jobs the past 7 days, and spent 20 hours on that. Duplicate jobs are jobs that get scheduled when there is already a job in the queue for the same worker with the same arguments. (#178 (closed))

Proposal:

A git fetch/pull happens
Redis holds a <fetch-count>:<project-id> which we INCR every fetch.
On a fetch, we also schedule UpdateProjectDailyStatisticsWorker (a new worker) in an hour, wrapped in an a lease that expires in 1 hour. This means we will only write the counter to the database once every hour per project.
The new worker reads the the count from redis, which contains the count for each hour, writes it to the database and resets the counter.

Edited Nov 24, 2020 by Rachel Nienaber