spectrum: speed up beamforming gain calculation in three-gpp-spectrum-propagation-loss-model.cc
Here are the before and after the patches for lte-lena-comparison-user --simTag=test2-user --trafficScenario=2 --simulator=5GLENA --technology=LTE --numRings=2 --ueNumPergNb=5 --calibration=false --freqScenario=0 --operationMode=FDD --direction=UL --RngRun=1 --RngSeed=1
(from the nr codebase) using the release
profile.
As it can be seen, sincos is the main villain (57% of retired/completed/unaborted instructions). The caller of those many sincos is none other than ThreeGppSpectrumPropagationLossModel::CalcBeamformingGain()
.
The reason for those sincos take almost half of the simulation time (cycles not in halt) is not only because trigonometric functions are slow, but also because the phases being passed as the arguments are first wrapped to something like -Pi/2, Pi/2 for the best performance. This causes a ton of branch misses, which starve the CPU backend and slows everything down.
Since the delay components for the channel don't change frequently, nor do the frequency bands and number of clusters, we can in theory cache the computed value of the propagation delay, reducing both trigonometric functions and wrapping cache misses. After doing this, we get the following.
In terms of speedup, we have about 1.4x speedup (~300s -> ~214s).
P.S.: A previous version of this MR incorrectly claimed a much larger speedup, but the number of events were completely off...
Merge request reports
Activity
added modulespectrum label
assigned to @Gabrielcarvfer
added 1 commit
- 76cacd0d - spectrum: optimize beamforming cluster gain calculation
- Resolved by Gabriel Ferreira
Well, you caught my attention :) There's a few things I don't understand here.
First, the outer loop seems unnecessarily complicated in construction, probably a remnant of the
while
. Could it be simplyfor (auto vit = tempPsd->ValuesBegin(), auto sbit = tempPsd->ConstBandsBegin(); vit != tempPsd->ValuesEnd(); ++vit, ++sit)
Second, the inner
for
loops all run over the same range, but only modify a singletemp[i]
each iteration. Could these loops be combined (eliminating the need for a vector temp)?for (auto vit...) { const double fsb = sbit->fc; std::complex<double> subbandsGain(0, 0); for (uint16_t cIndex = 0; cIndex < numCluster; ++cIndex) { const double delay = -2 * M_PI * fsb * channelParams->m_delay[cIndex]; std::complex<double> temp( cos(delay), sin(delay) ); temp *= longTerm[cIndex] * doppler[cIndex]; subbandsGain += temp; } double gain = norm(subbandsGain); *vit *= gain; }
I guess this is essentially reverting to the previous version with a single inner loop. Perhaps you split it to understand which computation line was expensive?
Third, the scaling by
longTerm * doppler
is unmodified by either of these loops, so could be hoisted outside completely (at the cost of reintroducing the vector temp):PhasedArrayModel::ComplexVector longTermDoppler = longTerm * doppler; // element-wise multiplication MatrixArray<double> twoPiDelay = -2 * M_PI * channelParams->m_delay; // scalar multiply for (auto vit...) { MatrixArray<double> fsbDelay = sbit->fc * twoPiDelay; // scalar multiply PhasedArrayModel::ComplexVector temp(numCluster); for (uint16_t cIndex = 0; cIndex < numCluster; ++cIndex) { const double delay = fsbDelay[cIndex]; temp[cIndex] = std::complex<double>( cos(delay), sin(delay) ); } *vit *= norm(DotProduct(temp, longTermDoppler))); // vector dot product }
(This assumes we embellish
MatrixArray
andComplexVector
with Expression Templates, or expand out the commented lines. The DotProduct is just `std::inner_product(std::begin(temp), std::end(temp), std::begin(longTermDoppler), 0.0
)Fourth, the vector
longTerm
is weird:- It's instantiated (l. 140), but never populated, so it should be empty.
- It's size is tested in an assert (l. 171) Why doesn't this fire?
- It's indexed (above, or the patch l. 249, or the original l. 243). How can that work if it's empty?
Edited by Peter Barnes
added performance label
- Resolved by Eduardo Almeida
I have been looking and testing this for some time. All tests are passing. Checked
ThreeGppSpectrumPropagationLossModel
logs up to 0.22s (800k lines), and they are matching perfectly.Tried a few different seeds, but still can't explain the difference in the number of events.
Testing with
lena-lte-comparison-user --simTag=test2-user --trafficScenario=2 --simulator=5GLENA --technology=LTE --numRings=2 --ueNumPergNb=5 --calibration=false --freqScenario=0 --operationMode=FDD --direction=UL --RngRun=1 --RngSeed=1
Seed Original (A) Modded (B) Difference (A-B) 1 19883834 19818277 65557 2 19534947 19505701 29246 3 19458601 19402595 56006 The maximum error found when comparing the two approaches was like 6.31089e-29...
Edited by Gabriel Ferreiraadded 42 commits
-
0754a894...e48ffcf4 - 41 commits from branch
nsnam:master
- b42541c2 - spectrum: precompute channel delay sincos
-
0754a894...e48ffcf4 - 41 commits from branch
Still a long way to go to get this perfect but 37%->10% of time spent on sincos is a win in my book.
It can probably get to 2x speedup with improvements to the data layout, since most of the time now is wasted waiting for the data to become available.
std::complex<double>
being 16-byte wide means my 256kB L1d cache can fit only 16k values in the best case scenario.The biggest SpectrumValue I saw had 1-2k bands, the cached sincos for the doppler component has that many values * number of clusters (e.g. 16), so we end up flushing the entire L1d cache to compute that last loop.
Reducing the precision to
std::complex<float>
gives an additional speedup of ~1.1x, but not sure how much precision is required, so not touching it for now.Edited by Gabriel FerreiraTrying fancy floating point compression with Zfp by LLNL. If it works, should give a nice speedup.
Update: didn't work...
Edited by Gabriel FerreiraWhile Intel might not have the best processors, they certainly do have the best tools. It clearly points the memory bottleneck in
auto subsbandGain = (doppler.GetValues() * channelParams->m_cachedDelaySincos[i]).sum();
Also, the P-core vs E-core memory performance difference shows their strategy is garbage (for our use case) and we should probably pin simulations to performance cores.
Edited by Gabriel FerreiraSidetracking as usual: A friend of mine has brought to my attention that this difference between the P-Cores (Golden Cove) and E-Cores (Gracemont) performance is actually on the load-store part. There is a missing load unit on the E-Cores, the scheduling from load and store is shared and the number of registers/entries is much smaller.
I didn't notice while profiling, but it is actually in the results.
Edited by Gabriel Ferreiraadded 245 commits
-
0f627666...efcbd25d - 244 commits from branch
nsnam:master
- 87b6070e - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
-
0f627666...efcbd25d - 244 commits from branch
added 1 commit
- 1d1899e0 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
- Resolved by Gabriel Ferreira
So, what is finally the performance gain after adding this optimization? Thanks!
added 1 commit
- b21de954 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
added 1 commit
- 147ffdac - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
For the
cttc-nr-mimo-demo
~1.36x withcttc-nr-mimo-demo --numHPortsGnb=1 --xPolGnb=1 --rankLimit=2 --gnbUeDistance=300 --enableMimoFeedback=1
Edited by Gabriel Ferreiramentioned in merge request cttc-lena/nr!111 (merged)
mentioned in merge request !1846 (merged)
added 54 commits
-
147ffdac...fec2c7b4 - 53 commits from branch
nsnam:master
- 2a400f4e - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
-
147ffdac...fec2c7b4 - 53 commits from branch
added 1 commit
- d3e57dd2 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
added 1 commit
- 8e0a5b13 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
added 1 commit
- aae8b5a7 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel
- Resolved by Gabriel Ferreira