TL;DR
Fix #2224 (closed)
This MR improves the speed of round/offset calculation with a 5x speed-up based on the simplified datatype brought by MR !3984 (merged).
┌───────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ binsearch │ 248.50us │ 133.75kw │ 8.09w │ 8.09w │ 100.00% │
│ sqrt │ 50.94us │ 20.15kw │ 0.43w │ 0.43w │ 20.50% │
└───────────┴──────────┴──────────┴──────────┴──────────┴────────────┘
Problem
We want to optimize function round_and_offset
in module round_repr.ml
to use an explicit solution driven by a quadratic equation. We denote:
- Input:
\tau_0 \in \mathbb{N}^*
, the first round duration - Input:
\tau_{\varepsilon} \in \mathbb{N}^*
, the delay increment per round - Input:
\text{level\_offset} \in \mathbb{R}^+
- Find the round and its offset
Let us define the sequence \tau_k
as the round duration of round k
:
{(\tau_k)}_{k\in\mathbb N} ~:~
\left\{\begin{matrix}
\tau_0 &=& \tau_0\\
\tau_{k+1} &=& \tau_{k} + \tau_{\varepsilon}
\end{matrix}\right.
Example
\tau_0 |
\tau_{\varepsilon} |
\text{level\_offset} |
round | its offset |
---|---|---|---|---|
30 | 15 | 32 | 1 | 2 |
30 | 15 | 75 | 2 | 0 |
The equation
From a level_offset
, we want to find the greatest round
such that the offset
(w.r.t. to this round) is the least (and non-negative).
In other words, the equation is to maximize \text{round}
such that \text{offset} \geq 0
and
\begin{aligned}
\sum_{k=0}^{\text{round} - 1} \tau_k + \text{offset} &= \text{level\_offset}
\end{aligned}
We deliberately minimize the \text{offset}
to 0 to solve the equation in \text{round}
\begin{aligned}
\sum_{k=0}^{\text{round} - 1} \tau_k &= \text{level\_offset}
\end{aligned}
💡 It is possible to unfold the sum due to its commutative properties, we have\begin{aligned} \sum_{k=0}^{\text{round} - 1} \tau_k &= \tau_0 + (\tau + \tau_{\varepsilon}) + (\tau + 2 \tau_{\varepsilon}) + (\tau + 3\tau_{\varepsilon}) + \dots + (\tau + (\text{round} - 1) \times \tau_{\varepsilon}) \\ &= \text{round} \times \tau_0 + \left ( \sum_{k = 1}^{\text{round} - 1} k \right ) \tau_{\varepsilon}\\ &= \text{round} \times \tau_0 + \frac{(\text{round} - 1) \times \text{round}}{2}\times \tau_{\varepsilon} \end{aligned}
The equation to solve is now explicited as
\begin{aligned}
\text{round} \times \tau_0 + \frac{(\text{round} - 1) \times \text{round}}{2} \times \tau_{\varepsilon} &= \text{level\_offset}
\end{aligned}
In order to further use a lemma below, we need to get rid of this division
\begin{aligned}
2 \times \text{round} \times \tau_0 + (\text{round} - 1) \times \text{round} \times \tau_{\varepsilon} &= 2 \times \text{level\_offset}
\end{aligned}
After expanding the terms, the quadratic equation in \text{round}
is
\begin{aligned}
\tau_{\varepsilon} \times \text{round}^2 + \left ( 2 \tau_0 - \tau_{\varepsilon} \right ) \times \text{round} - 2 \times \text{level\_offset} &= 0
\end{aligned}
Thereby, the discriminant is
\begin{aligned}
\Delta = \left ( 2 \tau_0 - \tau_{\varepsilon} \right )^2 + 8 \times \tau_{\varepsilon} \times \text{level\_offset}
\end{aligned}
Hence,
\begin{aligned}
\text{round} = \frac{\tau_{\varepsilon} - 2 \tau_0 + \sqrt \Delta}{2 \tau_{\varepsilon}}
\end{aligned}
Once upon a time in the world of integer arithmetics
To recall, we want to compute in integer arithmetics. This means we should postpone floating-prone computations to top-level formulae. The problem now lies in the square root above \Delta
which impossibly ensures the yielded value to be integer. Nevertheless, a sound approximation is possible using a lemma on quotients with floor.
Lemma (equation 3.11, Graham-Knuth-Patashnik, p. 72).
\forall x \in \mathbb R \quad \forall m \in \mathbb Z \quad \forall n \in \mathbb N^*
\quad \left \lfloor \frac{m + x}{n} \right \rfloor = \left \lfloor \frac{m + \left \lfloor x\right \rfloor}{n} \right \rfloor
From that we can state that,
\begin{aligned}
\left \lfloor \text{round} \right \rfloor &= \left \lfloor \frac{\tau_{\varepsilon} - 2 \tau_0 + \sqrt \Delta}{2 \tau_{\varepsilon}} \right \rfloor \\
&= \left \lfloor \frac{\tau_{\varepsilon} - 2 \tau_0 + \left \lfloor \sqrt \Delta \right \rfloor}{2 \tau_{\varepsilon}} \right \rfloor
\end{aligned}
Finally, we can compute integer square-root \left \lfloor \sqrt \Delta \right \rfloor
using either Newton-Raphson or 4-adic induction [Kreitz, 2003].
Space
For space-complexity, \Delta
can grow really big but still linearly in O (\text{level\_offset})
. We need to ensure that \Delta
does not overflow 2^{63}-1
.
\begin{aligned}
\Delta &\quad\leq\quad 2^{63} - 1\\
\left ( 2 \tau_0 - \tau_{\varepsilon} \right )^2 + 8 \times \tau_{\varepsilon} \times \text{level\_offset} &\quad\leq\quad 2^{63} - 1\\
\text{level\_offset} &\quad\leq\quad \frac{2^{63} - 1 - (2 \tau_0 - \tau_{\varepsilon})^2}{8\tau_{\varepsilon}}
\end{aligned}
If we have a \tau_{0} = 30
and \tau_{\varepsilon} = 15
, this means that a reasonably large upper bound for \text{level\_offset}
is 10^{16} ~ \text{sec}
. In other words, 10^9
years. The risk of reaching the overflow is thereby low.
Time
In terms of time-complexity w.r.t. to \text{level\_offset}
, is this optimal? Not everything is given for free and what costs the most is the integer square root operation \left \lfloor\sqrt{\Delta} \right \rfloor
. The 4-adic induction and the Newton-Raphson method provide a logarithmic algorithm and this would make the solution O (\text{log}~ \text{level\_offset})
. In this problem, we choose Newton-Raphson as 4-adic induction tends to reach overflow on call stack.
Benchmarks
All the following values were measured with @richard.bonichon's micro-benchmarks repository (using the core_bench library).
Initially, out first benchmarks were conducted on a uniformly distributed sample of values of \text{level\_offset}
(ranging from 0
to 1000
). The following table exhibits the results using a uniform distribution of sample and show a x5 speed-up.
┌───────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ binsearch │ 849.44us │ 244.80kw │ 16.02w │ 16.02w │ 100.00% │
│ sqrt │ 176.01us │ 22.46kw │ 0.72w │ 0.72w │ 20.72% │
└───────────┴──────────┴──────────┴──────────┴──────────┴────────────┘
In reality, we can argue that the values for this input are not uniformly distributed. Here is a chart of how values for \text{level\_offset}
are distributed in practice during a 12-hour span:
We repeat the benchmark experiment using the statistical distribution as the sample, and we measured the following values. We can observe that the x5 speed-up is preserved even for high incidence around input 0
:
┌───────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ binsearch │ 248.50us │ 133.75kw │ 8.09w │ 8.09w │ 100.00% │
│ sqrt │ 50.94us │ 20.15kw │ 0.43w │ 0.43w │ 20.50% │
└───────────┴──────────┴──────────┴──────────┴──────────┴────────────┘
These experiments tend to show that this solution remain effective whether the distribution is uniform or Pareto.