Skip to content

Log worker_id for Puma and Sidekiq

Matthias Käppler requested to merge 364539-log-worker-id into master

What does this MR do and why?

See #364539 (closed)

In !66694 (merged) we added a worker's process ID to application logs. This was done to support diagnosis during production incidents.

In &8105 I found that in addition to that, having the logical worker_id would be more useful instead. This is because in Thanos, we do not collect PIDs, since that label value would be essentially unbounded. Plus, it's ephemeral; for instance, puma_0 might get restarted and hence run under a new PID, but it's still the same worker. So when breaking down by the pid label in Thanos (this is actually the worker ID, not the process ID), we cannot currently correlate these data with logs.

Here I extend InstrumentationHelper to also log PidProvider#worker_id. This is used by both Puma and Sidekiq, so it will work for both.

For Puma, we currently run 7 processes in SaaS: 1 master + 6 workers. So the label cardinality is 7. For Sidekiq, we run a single process, so label cardinality is 1.

I also got 👍 from groupscalability on Slack (internal only) that this should not be a problem in terms of added log volume, and the call into PidProvider is cheap.

Screenshots or screen recordings

$ tail -n1 log/sidekiq.log                         
{"severity":"INFO",...,"worker_id":"sidekiq_0",...,"db_duration_s":0.099319}
$ tail -n1 log/development_json.log 
{"method":"GET",...,"worker_id":"puma_1",...,"duration_s":2.04893}

How to set up and validate locally

Run Puma or Sidekiq and grep for worker_id in the logs (see above.)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #364539 (closed)

Merge request reports