Deduplicate Namespaces::ScheduleAggregationWorker as a PoC for job deduplication
During the last 7 days, we performed about 900.000 duplicate jobs for the Namespaces::ScheduleAggregationWorker
.
This worker gets scheduled in several cases through the UpdateProjectStatistics
concern, in total we've seen these callers:
json.meta.caller_id.keyword: Descending | Count | Sum of json.duration | Average json.duration |
---|---|---|---|
ProjectImportScheduleWorker | 57,528,644 | 46,921,993.56 | 1.632 |
PostReceive | 76,136,085 | 21,501,246.07 | 0.565 |
UpdateAllMirrorsWorker | 57,774,717 | 20,301,395.10 | 0.703 |
BuildFinishedWorker | 52,909,823 | 6,103,576.09 | 0.231 |
PipelineProcessWorker | 60,419,877 | 5,540,395.56 | 0.183 |
/api/:version/jobs/:id | 61,731,197 | 4,308,231.52 | 0.14 |
/api/:version/jobs/request | 49,571,565 | 2,189,096.06 | 0.088 |
Repositories::GitHttpController#git_upload_pack | 76,976,916 | 1,793,317.51 | 0.047 |
PipelineUpdateWorker | 18,106,869 | 1,217,426.92 | 0.134 |
RepositoryUpdateMirrorWorker | 30,791,186 | 1,095,830.15 | 0.071 |
https://log.gprd.gitlab.net/goto/b5adba783edac35da929be89e3889d83 |
This worker creates a Namespaces::AggregationSchedule
record to be consumed by the Namespaces::RootStatisticsWorker
.
To do this, we need to mark the worker as idempotent, and add it to the list of deduplicatable workers.
We can easily see something goes wrong: in the worst case we won't see any Namespaces::RootStatisticsWorker
s be scheduled anymore. But no data is lost, we just need to start creating the records again, since that will cause the Namespaces::RootStatisticsWorker
to be scheduled again.
This is a test of #42 (closed) on a queue that has less impact than enabling on all queues in one go.