Split reactive_caching queue work
Problem
The ReactiveCachingWorker
performs a mix of DB/CPU-bound expensive work (MergeRequest
below) and some other work has external dependencies:
json.meta.related_class.keyword: Descending | Count | 50th percentile of json.duration | 95th percentile of json.duration | 99th percentile of json.duration |
---|---|---|---|---|
MergeRequest | 186,461 | 0.243 | 29.552 | 77.663 |
Missing | 43,659 | 0.695 | 12.63 | 70.344 |
Environment | 28,351 | 0.725 | 3.581 | 5.628 |
Clusters::Cluster | 23,267 | 0.409 | 1.245 | 10.034 |
Prometheus::ProxyService | 18,212 | 0.034 | 1.78 | 2.767 |
Projects::Serverless::FunctionsFinder | 3,878 | 0.051 | 0.472 | 1.449 |
ErrorTracking::ProjectErrorTrackingSetting | 3,310 | 0.536 | 1.204 | 2.439 |
Metrics::Dashboard::GrafanaMetricEmbedService | 3,028 | 0.172 | 0.775 | 1.512 |
Clusters::Applications::Prometheus | 2,491 | 0.881 | 4.134 | 6.307 |
PodLogs::ElasticsearchService | 2,436 | 1.214 | 2.455 | 4.291 |
PodLogs::KubernetesService | 2,371 | 0.993 | 2.485 | 4.235 |
Grafana::ProxyService | 930 | 0.376 | 5.277 | 6.875 |
Clusters::KnativeServicesFinder | 489 | 0.445 | 1.403 | 2.02 |
PrometheusService | 332 | 0.131 | 3.334 | 5.506 |
SshHostKey | 63 | 0.036 | 0.82 | 2.462 |
Source: https://log.gprd.gitlab.net/goto/eceeedc94dcf46015404ef6aa86a4100
If we take a better look at the MergeRequest
alone, it breaks down into some expensive DB/CPU-bound work:
json.args.keyword: Descending | Count | 50th percentile of json.duration | 95th percentile of json.duration | 99th percentile of json.duration | 50th percentile of json.db_duration_s | 95th percentile of json.db_duration_s | 99th percentile of json.db_duration_s | 50th percentile of json.cpu_s | 95th percentile of json.cpu_s | 99th percentile of json.cpu_s |
---|---|---|---|---|---|---|---|---|---|---|
MergeRequest | 266990 | 0.28959025557282736 | 26.102468550326694 | 72.5902692496675 | 0.02382557378410509 | 0.9291014493265197 | 8.006470014650452 | 0.05736682659398582 | 13.299713508977286 | 32.0774735349686 |
211324 | 0.2685395307670134 | 33.70175896301224 | 75.48868837615332 | 0.021803826229914645 | 0.34156274192297265 | 1.3030881320113263 | 0.05363908793717763 | 14.40842907944788 | 32.88702559437787 | |
Ci::CompareTestReportsService | 167075 | 0.3051229784618997 | 51.445711854540484 | 78.76643916774934 | 0.022430414532161325 | 0.298067969247473 | 1.012541648497135 | 0.06334885030368746 | 25.660160963330966 | 33.68258516055735 |
Ci::CompareDependencyScanningReportsService | 22628 | 0.4278630319144505 | 3.208946057373913 | 6.3040682556515675 | 0.07445493490459064 | 1.118517578335908 | 2.8839473080635103 | 0.16759984159859725 | 1.2499942469566978 | 2.2076902154286766 |
Ci::CompareSastReportsService | 21547 | 0.9571904279219411 | 32.56472627520558 | 50.706503562018895 | 0.16121082347405108 | 14.628090273170233 | 28.489983847300152 | 0.28199653379788003 | 15.248128792691354 | 21.10525130244659 |
Ci::GenerateExposedArtifactsReportService | 20479 | 0.13570954201289057 | 1.0256806015245836 | 2.3195084776197152 | 0.015647163934152133 | 0.16199858472089862 | 0.49252982335090556 | 0.03031762798136114 | 0.1810661454207051 | 0.5972474047541607 |
Ci::CompareMetricsReportsService | 15668 | 0.22502747449195262 | 1.0662240886688217 | 1.9223772287368763 | 0.02296785388935457 | 0.3232782589814556 | 0.6733212133248644 | 0.04407075179730747 | 0.08951310886339069 | 0.1827012864748636 |
Ci::CompareLicenseScanningReportsService | 7559 | 0.22488950697027496 | 1.0280218554867633 | 2.083878092765807 | 0.03390210321965703 | 0.39452704389459936 | 1.1840766346454619 | 0.05054281143417576 | 0.21663270954574837 | 0.29869629830121985 |
Ci::CompareDastReportsService | 6947 | 1.166966150793837 | 6.111275633176166 | 8.752203531265252 | 0.12091184689691573 | 1.2040272338951326 | 2.6099918238321873 | 0.3717579220400916 | 2.6780309463398795 | 3.357187442779539 |
Ci::CompareContainerScanningReportsService | 5087 | 0.3547566209270647 | 8.949523471461399 | 26.69304637908936 | 0.04121196497645643 | 4.2067577951993655 | 12.793855390548721 | 0.07475935427284745 | 4.359703940153117 | 11.39702380180359 |
Source: https://log.gprd.gitlab.net/goto/4a8b077fe4f6cb52ffc22124ff95c5c5
Turns out that the majority of the work in this queue should be low-urgency, that's because the work is either:
- Slow DB/CPU work (i.e. CI reports)
- External dependency (which normally performs well, but shouldn't really be set as high-urgency)
Today, the ReactiveCachingWorker
is set as high-urgency and can hit our SLOs (see https://docs.gitlab.com/ee/development/sidekiq_style_guide.html#job-urgency).
Proposal
- Move
ReactiveCachingWorker
to a default/low urgency - Keep the
MergeRequest
work (which is slow, mostly CPU-bound) atreactive_caching
queue - Move the low-urgency + external dependencies to a separate worker
- Move any other work that doesn't have external dependencies and perform well to a high-urgency queue (no work seems to meet that criteria though)
After #223 (closed), we can look at splitting up the ReactiveCachingWorker into separate queues by their profile: perhaps when SshHostKey uses it it runs faster, or isn't urgent. It's also possible that this isn't possible: maybe all derived classes end up with similar execution profiles.Old description