Deduplicate Namespaces::ScheduleAggregationWorker as a PoC for job deduplication
During the last 7 days, we performed about 900.000 duplicate jobs for the Namespaces::ScheduleAggregationWorker.
This worker gets scheduled in several cases through the UpdateProjectStatistics concern, in total we've seen these callers:
| json.meta.caller_id.keyword: Descending | Count | Sum of json.duration | Average json.duration |
|---|---|---|---|
| ProjectImportScheduleWorker | 57,528,644 | 46,921,993.56 | 1.632 |
| PostReceive | 76,136,085 | 21,501,246.07 | 0.565 |
| UpdateAllMirrorsWorker | 57,774,717 | 20,301,395.10 | 0.703 |
| BuildFinishedWorker | 52,909,823 | 6,103,576.09 | 0.231 |
| PipelineProcessWorker | 60,419,877 | 5,540,395.56 | 0.183 |
| /api/:version/jobs/:id | 61,731,197 | 4,308,231.52 | 0.14 |
| /api/:version/jobs/request | 49,571,565 | 2,189,096.06 | 0.088 |
| Repositories::GitHttpController#git_upload_pack | 76,976,916 | 1,793,317.51 | 0.047 |
| PipelineUpdateWorker | 18,106,869 | 1,217,426.92 | 0.134 |
| RepositoryUpdateMirrorWorker | 30,791,186 | 1,095,830.15 | 0.071 |
| https://log.gprd.gitlab.net/goto/b5adba783edac35da929be89e3889d83 |
This worker creates a Namespaces::AggregationSchedule record to be consumed by the Namespaces::RootStatisticsWorker.
To do this, we need to mark the worker as idempotent, and add it to the list of deduplicatable workers.
We can easily see something goes wrong: in the worst case we won't see any Namespaces::RootStatisticsWorkers be scheduled anymore. But no data is lost, we just need to start creating the records again, since that will cause the Namespaces::RootStatisticsWorker to be scheduled again.
This is a test of #42 (closed) on a queue that has less impact than enabling on all queues in one go.