ProjectDailyStatisticsWorker deduplication strategy
ProjectDailyStatisticsWorker more often than not gets way too many accumulated jobs in the project_daily_statistics queue:
ProjectDailyStatisticsWorker is called with a single argument, which is the project_id, presenting itself as a duplicated quite often with our duplicate-of new log:
https://log.gprd.gitlab.net/goto/6ca7a1341fae85518e44aa61b2a1510e
Turns out we cannot deduplicate this, it requires all scheduled jobs to correctly manage the fetch_count column counting on the state of previous runs. So it counts on each job to run exactly once.
increment_fetch_count_sql = <<~SQL
INSERT INTO #{table_name} (project_id, date, fetch_count)
VALUES (#{project.id}, '#{Date.today}', 1)
ON CONFLICT (project_id, date) DO UPDATE SET fetch_count = #{table_name}.fetch_count + 1
SQL
This job is scheduled in a quite hot-path, at git_upload_pack endpoint, so seeing so many "duplicates" is expected here.
So in summary, today we have:
- A git fetch/pull happens
-
ProjectDailyStatisticsWorkeris scheduled - The job updates a row with
fetch_count + 1
Reason to do this work:
The
ProjectDailyStatisticsWorkerprocessed 500k "duplicate" jobs the past 7 days, and spent 20 hours on that. Duplicate jobs are jobs that get scheduled when there is already a job in the queue for the same worker with the same arguments. (#178 (closed))
Proposal:
- A git fetch/pull happens
- Redis holds a
<fetch-count>:<project-id>which weINCRevery fetch. - On a fetch, we also schedule
UpdateProjectDailyStatisticsWorker(a new worker) in an hour, wrapped in an a lease that expires in 1 hour. This means we will only write the counter to the database once every hour per project. - The new worker reads the the count from redis, which contains the count for each hour, writes it to the database and resets the counter.

