Log Sidekiq job time spend on CPU
We merged the https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/31866.
However, this is not working as expected as we are using Process.times
that accounts
all CPU time from all threads. This results in wrong results returned and attached
to each structured_logging
.
Take a look at this execution of Chaos::CPUSpin
:
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:21:48.202Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"c96cab955a7bfb11beb228ee","created_at":"2019-09-13T09:21:48.200Z","correlation_id":"49d505a6-5f81-4da7-8f57-b291b2e7e684","enqueued_at":"2019-09-13T09:21:48.200Z","pid":6,"message":"Chaos::CpuSpinWorker JID-c96cab955a7bfb11beb228ee: start","job_status":"start","scheduling_latency_s":0.002}
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:21:49.008Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"9da49ab0530f75d99d4cb294","created_at":"2019-09-13T09:21:48.890Z","correlation_id":"f5b97418-b50b-449c-9904-ac8535ba8f53","enqueued_at":"2019-09-13T09:21:48.891Z","pid":6,"message":"Chaos::CpuSpinWorker JID-9da49ab0530f75d99d4cb294: start","job_status":"start","scheduling_latency_s":0.117}
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:22:18.242Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"c96cab955a7bfb11beb228ee","created_at":"2019-09-13T09:21:48.200Z","correlation_id":"49d505a6-5f81-4da7-8f57-b291b2e7e684","enqueued_at":"2019-09-13T09:21:48.200Z","pid":6,"message":"Chaos::CpuSpinWorker JID-c96cab955a7bfb11beb228ee: done: 30.04 sec","job_status":"done","scheduling_latency_s":0.002,"duration":30.04,"system_s":0.0,"user_s":30.016,"completed_at":"2019-09-13T09:22:18.242Z"}
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:22:48.263Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"9da49ab0530f75d99d4cb294","created_at":"2019-09-13T09:21:48.890Z","correlation_id":"f5b97418-b50b-449c-9904-ac8535ba8f53","enqueued_at":"2019-09-13T09:21:48.891Z","pid":6,"message":"Chaos::CpuSpinWorker JID-9da49ab0530f75d99d4cb294: done: 59.256 sec","job_status":"done","scheduling_latency_s":0.117,"duration":59.256,"system_s":0.0,"user_s":59.185,"completed_at":"2019-09-13T09:22:48.263Z"}
We expect the user_s
to be of 15s
. Why? We run these two jobs at the same time for max duration of 30s. It means that each job should consume 15s of thread time, the total of process time being 30s.
We really want to use getrusage(RUSAGE_THREAD)
. This is not currently exported by Ruby which means that we should write native extension to add this.
For time being it seems that we cannot show system_s
and child_s
, but we can show user_s
with usage of:
Process.clock_gettime(Process::CLOCK_THREAD_CPUTIME_ID)
After using that these are results of Sidekiq execution:
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:33:23.091Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"20f414d46f4148320d7d6bd4","created_at":"2019-09-13T09:33:23.085Z","correlation_id":"8d8d6ac1-662e-4ec9-aabf-5ad360e579ca","enqueued_at":"2019-09-13T09:33:23.090Z","pid":6,"message":"Chaos::CpuSpinWorker JID-20f414d46f4148320d7d6bd4: start","job_status":"start","scheduling_latency_s":0.001}
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:33:24.138Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"64acd3be6b4837bad4117d32","created_at":"2019-09-13T09:33:23.989Z","correlation_id":"2c9ee529-9bde-4446-8c74-9863fb241a93","enqueued_at":"2019-09-13T09:33:23.990Z","pid":6,"message":"Chaos::CpuSpinWorker JID-64acd3be6b4837bad4117d32: start","job_status":"start","scheduling_latency_s":0.148}
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:33:53.240Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"20f414d46f4148320d7d6bd4","created_at":"2019-09-13T09:33:23.085Z","correlation_id":"8d8d6ac1-662e-4ec9-aabf-5ad360e579ca","enqueued_at":"2019-09-13T09:33:23.090Z","pid":6,"message":"Chaos::CpuSpinWorker JID-20f414d46f4148320d7d6bd4: done: 30.149 sec","job_status":"done","scheduling_latency_s":0.001,"duration":30.149,"system_s":0,"user_s":15.509,"completed_at":"2019-09-13T09:33:53.240Z"}
sidekiq_1 | {"severity":"INFO","time":"2019-09-13T09:33:54.217Z","class":"Chaos::CpuSpinWorker","retry":3,"queue":"chaos:chaos_cpu_spin","backtrace":true,"queue_namespace":"chaos","jid":"64acd3be6b4837bad4117d32","created_at":"2019-09-13T09:33:23.989Z","correlation_id":"2c9ee529-9bde-4446-8c74-9863fb241a93","enqueued_at":"2019-09-13T09:33:23.990Z","pid":6,"message":"Chaos::CpuSpinWorker JID-64acd3be6b4837bad4117d32: done: 30.079 sec","job_status":"done","scheduling_latency_s":0.148,"duration":30.079,"system_s":0,"user_s":15.393,"completed_at":"2019-09-13T09:33:54.217Z"}
Completely as expected!