Skip to content

Run jemalloc stats report by timer

Aleksei Lipniagov requested to merge 362900-jemalloc-stats-report into master

What does this MR do and why?

PoC for #362900 (closed).

On a regular basis (timer-based) pulls Jemalloc Stats dump from every Puma worker.

FF rollout issue: #367845 (closed)

SRE rollout support issue: gitlab-com/gl-infra/delivery#2486 (closed)

Next steps

  • This MR (1) goes through reviews, including SRE
  • I asked John to help with the emptyDir k8s configuration for .com
  • We could merge this MR when it's OK from the code perspective, as it wouldn't be enabled until we added ENV var + flipped FF. So it shouldn't be blocked.
  • After it's merged, I'll open an MR (2) to add GITLAB_DIAGNOSTIC_REPORTS_ENABLED into our staging or canaries
  • After the MR (2) with ENV vars is merged and we redeployed, we could activate the FF, keeping an eye on the metrics (although I don't expect to see anything out of the order on canary)
  • When it's OK on canary, and we could confirm that reports are being generated, I will open an MR (3) to add GITLAB_DIAGNOSTIC_REPORTS_ENABLED to production. Once again we'll go with the FF activation now on prod.
  • In parallel, we could work on the reports upload feature

TODO

First iteration

  • Run CPU utilization test with and without reports (and vary the frequency)
  • Cover everything with specs
  • Restrict the growth of the report dir (in code or in volume config or both)
  • Consider adding an ENV var switch to enable reports per-node
  • Ask SRE to pull Jemalloc reports from production. Both for Puma and Sidekiq workers. Note the size of the report and how long it takes to produce it. Request issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15993
  • Consider and decide if we want to use per-process thread or a centralized master thread which would signal the workers to report (more in !91283 (comment 1011494980))
  • Investigate how this affects performance. Set the aggressive frequency of the report generation, and check how it hits the resources. Via GPT or ab over a single endpoint. Results: https://docs.google.com/document/d/1ODuAvY5dKnuftgpwis3KK3-4ighpQRsma-mVp1l8cR8/edit?usp=sharing
  • Make the imports folder configurable. ENV variable + safe default (e.g. /tmp)
  • Try to reuse Daemon class for the implementation of the Timer
  • Include logical worker id alongside PID in the report filename
  • Run initializer only for Puma/Sidekiq (currently on a general initializer, which would run even in rails c)
  • Investigate and fix CI failures (review-deploy is failing)

Strecth goals / Follow-ups

  • [-] we may want to profile the report generation in-depth (code execution). More details could be found in the MR which introduced the report.
  • [-] gzip reports (concern: CPU-heavy operation, would need additional performance tests)
  • configure automatic reports cleanups
  • [-] add another report

Risks & Performance

Storage:

  • We'll put additional restriction to the dir (see the discussion on the emptyDir in the comments)
  • On production, each report was around 2.5 MB (more details).
  • Running every hour, it'll be ~60 MB/day if no cleanups are done.

System performance:

  • On production, each report took 2-10 seconds (more details)
  • To test it locally, I run GCK in production mode.
  • All configs were default set by GCK (2 puma workers).
  • I run the apache benchmark with ab -t 300 -c 8 "http://localhost:3000/api/v4/projects".
  • Here are full results with no reporting, reporting every 1, 10, 30 seconds: https://docs.google.com/document/d/1ODuAvY5dKnuftgpwis3KK3-4ighpQRsma-mVp1l8cR8/edit?usp=sharing
  • Based on them, we shouldn't expect any visible performance impact (especially taking into account that we are not going to run them too frequently)
  • We should keep in mind that GCK reports are generated much faster (~1s GCK vs 5-10s on prod) and take less space (< 1 mb GCK vs ~2.5 MB on prod)
  • Still, running reports every hour shouldn't make any visible impact

How to set up and validate locally

I suggest testing locally with GCK.
The reason is to pull the actual Jemalloc report, libjemalloc must be on LD_PRELOAD (more).
It is already configured in GCK this way.

  1. Pull this branch: 362900-jemalloc-stats-report
  2. Set smaller timeouts in JemallocStats, e.g. 10 (seconds) each
  3. Open the path (currently: tmp/ - could be changed, refer to the code) and check that the reports are here.

Screenshots or screen recordings

Screenshot_2022-06-28_at_17.58.14

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #362900 (closed)

Edited by Aleksei Lipniagov

Merge request reports