(2023Q1) Gas effort: gas parameter updates

This is a sub-milestone of %(2023Q1) Gas effort. It can be seen as a continuation of %(OKR 2022Q4 - ???) Gas Effort: Gas costs regressions detected twice a week and %(OKR 2022Q4 - ?.?) Gas Effort: Collect the errors of snoop inference.

This milestone is being continued in %(2023Q2) Gas effort: gas parameter updates, with a lot also being backlogged in %(2023Q2) Gas effort.

Work break down

Some technical details are reported below.

Quality scores of benchmarks are reported: #4131 (closed).
Describe processes to trigger a full run of the benchmarks and how to interpret the results.
- Process is described here.
- Need to define who has access to the process: in review.
Improve how we publish benchmark results: #5097 (closed).
Benchmarks are evaluated based on their quality score: #4130 (closed).
The flakiness of some benchmarks actually come from a deep, fundamental flaw in the benchmark dependency analysis. See this note.

Recurring activities

Update gas parameters when their value changes.
- 2 gas parameters updated.
3/38 flaky benchmarks fixed:
- 3 by their model;
- 0 from an incorrect benchmark code;
- 0 by reinforcement (loop amplification).
Use and improve the quality score algorithm, if needed.
- Current method: T-value: !7544 (merged).

Note: there are some issues to run benchmarks on M1 processors (affecting TIMER_LATENCY). We should avoid benchmarking with this architecture.

Context

There is still a long way before enabling fully automatic gas parameter updates. For now, alerts are sent after each benchmarks run. They are inspected manually to check which are false alarms, and which seem to represent an actual change in gas.

However, we still get a lot of false alarms, that can have various causes:

the benchmark machine did not perform well on a specific benchmark for some reason (maybe it was asked to run another program at the same moment for instance);
the benchmark is incorrect, which can happen for example if the function was declared to be quadratic while it is linear in facts;
the benchmark code can have some issues, for instance if it benchmarks more than just the function, like parts of the setup;
the value is so small that a few nanosecond variation already represents a big difference in percentage.

We're following the directions below to try and improve the situation:

using and improving the quality scores of benchmarks to fix incorrect models;
reinforcing benchmarks of small values with loop amplification;
improve benchmark reporting to reduce analysis time.

How to interpret scores

See this note.

TODO: simplify even further (for a non-gas team developer) and add examples .

Technical details

This section gives more details on specific items from the Work break down section.