Use median over multiple runs for mem-on-boot (!81560) · Merge requests · GitLab.org / GitLab

Aleksei Lipniagov requested to merge 351454-avoid-outliers-in-mem-on-boot into master Feb 24, 2022

What does this MR do and why?

We noticed that we see outliers in derailed perf: mem results.
Check this MR's pipeline as an illustration. There's ~20 MB in consecutive runs of the same script.
We don't have a clear explanation yet, I opened the issue to investigate and discuss - link.
Our primary goal here is to make the data more reliable than we currently produce.

Spikes are rare and happen less than once in 10 runs, based on my experiments with GDK and GCK. To prevent them from skewing the data in the Metrics Reports (the dropdown on the MR page where we publish the data atm), I suggest running the script 5 times (we could reduce that to 3 later) and taking the median of the results it produces. We'll keep reports from all 5 runs to investigate the spikes.

It won't affect the overall pipeline execution time:

we run derailed in the memory-on-boot job
it is in the test stage
it takes ~8 minutes to prepare the box, run the derailed perf: mem 5 times and calculate the median
the typical test job in the same stage takes ~40 minutes, so it won't block moving to the next stage
overall CI time also wouldn't be affected much (+ ~5 minutes of total computational time of the pipeline)

I also suggest increasing the artifact expiry time twice (from 31 days to 62 days) to span across two self-managed releases for now. It would help to understand how the value evolves.

How to set up and validate locally

Open pipeline of this MR
Check memory-on-boot job (test stage, where we run all the unit specs & others)

Open the job
Check that we have artifacts of all 5 runs (they will be numbered by execution time)
We use the third value (median), of all 5 runs, if we'll sort them

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #351454 (closed)

Edited Feb 25, 2022 by Aleksei Lipniagov

Use median over multiple runs for mem-on-boot

What does this MR do and why?

How to set up and validate locally

MR acceptance checklist

Merge request reports