Skip to content

Use median over multiple runs for mem-on-boot

Aleksei Lipniagov requested to merge 351454-avoid-outliers-in-mem-on-boot into master

What does this MR do and why?

We noticed that we see outliers in derailed perf: mem results.
Check this MR's pipeline as an illustration. There's ~20 MB in consecutive runs of the same script.
We don't have a clear explanation yet, I opened the issue to investigate and discuss - link.
Our primary goal here is to make the data more reliable than we currently produce.

Spikes are rare and happen less than once in 10 runs, based on my experiments with GDK and GCK. To prevent them from skewing the data in the Metrics Reports (the dropdown on the MR page where we publish the data atm), I suggest running the script 5 times (we could reduce that to 3 later) and taking the median of the results it produces. We'll keep reports from all 5 runs to investigate the spikes.

It won't affect the overall pipeline execution time:

  • we run derailed in the memory-on-boot job
  • it is in the test stage
  • it takes ~8 minutes to prepare the box, run the derailed perf: mem 5 times and calculate the median
  • the typical test job in the same stage takes ~40 minutes, so it won't block moving to the next stage
  • overall CI time also wouldn't be affected much (+ ~5 minutes of total computational time of the pipeline)

I also suggest increasing the artifact expiry time twice (from 31 days to 62 days) to span across two self-managed releases for now. It would help to understand how the value evolves.

How to set up and validate locally

  • Open pipeline of this MR
  • Check memory-on-boot job (test stage, where we run all the unit specs & others)

Screenshot_2022-02-25_at_03.14.11

  • Open the job
  • Check that we have artifacts of all 5 runs (they will be numbered by execution time)
  • We use the third value (median), of all 5 runs, if we'll sort them

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #351454 (closed)

Edited by Aleksei Lipniagov

Merge request reports