Skip to content
Snippets Groups Projects

spectrum: speed up beamforming gain calculation in three-gpp-spectrum-propagation-loss-model.cc

All threads resolved!

Here are the before and after the patches for lte-lena-comparison-user --simTag=test2-user --trafficScenario=2 --simulator=5GLENA --technology=LTE --numRings=2 --ueNumPergNb=5 --calibration=false --freqScenario=0 --operationMode=FDD --direction=UL --RngRun=1 --RngSeed=1 (from the nr codebase) using the release profile.

Before

Screenshot_from_2023-10-30_12-19-21

As it can be seen, sincos is the main villain (57% of retired/completed/unaborted instructions). The caller of those many sincos is none other than ThreeGppSpectrumPropagationLossModel::CalcBeamformingGain().

The reason for those sincos take almost half of the simulation time (cycles not in halt) is not only because trigonometric functions are slow, but also because the phases being passed as the arguments are first wrapped to something like -Pi/2, Pi/2 for the best performance. This causes a ton of branch misses, which starve the CPU backend and slows everything down.

Since the delay components for the channel don't change frequently, nor do the frequency bands and number of clusters, we can in theory cache the computed value of the propagation delay, reducing both trigonometric functions and wrapping cache misses. After doing this, we get the following.

After Screenshot_from_2023-10-30_13-55-56

In terms of speedup, we have about 1.4x speedup (~300s -> ~214s).

P.S.: A previous version of this MR incorrectly claimed a much larger speedup, but the number of events were completely off... :sweat_smile:

Edited by Gabriel Ferreira

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    • 76cacd0d - spectrum: optimize beamforming cluster gain calculation

    Compare with previous version

  • Hmm, weird shenanigans. It worked just fine, now slowed down again. Must be a memory layout thing.

  • Reworking ChannelParams to group all cluster data close together, which will make things more cache friendly and hopefully able to feed the cpu backend.

  • Closing this temporarily while I work around the trigonometric functions. image

    • Resolved by Gabriel Ferreira

      Well, you caught my attention :) There's a few things I don't understand here.

      First, the outer loop seems unnecessarily complicated in construction, probably a remnant of the while. Could it be simply

      for (auto vit = tempPsd->ValuesBegin(), auto sbit = tempPsd->ConstBandsBegin();
           vit != tempPsd->ValuesEnd();
           ++vit, ++sit)

      Second, the inner for loops all run over the same range, but only modify a single temp[i] each iteration. Could these loops be combined (eliminating the need for a vector temp)?

      for (auto vit...)
      {
          const double fsb = sbit->fc;
          std::complex<double> subbandsGain(0, 0);
          for (uint16_t cIndex = 0; cIndex < numCluster; ++cIndex)
          {
              const double delay = -2 * M_PI * fsb * channelParams->m_delay[cIndex];
              std::complex<double> temp( cos(delay), sin(delay) );
              temp *= longTerm[cIndex] * doppler[cIndex];
              subbandsGain += temp;
          }
          double gain = norm(subbandsGain);
          *vit *= gain;
      }

      I guess this is essentially reverting to the previous version with a single inner loop. Perhaps you split it to understand which computation line was expensive?

      Third, the scaling by longTerm * doppler is unmodified by either of these loops, so could be hoisted outside completely (at the cost of reintroducing the vector temp):

      PhasedArrayModel::ComplexVector longTermDoppler = longTerm * doppler;  // element-wise multiplication
      MatrixArray<double> twoPiDelay = -2 * M_PI * channelParams->m_delay;   // scalar multiply
      for (auto vit...)
      {
          MatrixArray<double> fsbDelay = sbit->fc * twoPiDelay;              // scalar multiply
          PhasedArrayModel::ComplexVector temp(numCluster);
          for (uint16_t cIndex = 0; cIndex < numCluster; ++cIndex)
          {
              const double delay = fsbDelay[cIndex];
              temp[cIndex] = std::complex<double>( cos(delay), sin(delay) );
          }
          *vit *= norm(DotProduct(temp, longTermDoppler)));                  // vector dot product
      }

      (This assumes we embellish MatrixArray and ComplexVector with Expression Templates, or expand out the commented lines. The DotProduct is just `std::inner_product(std::begin(temp), std::end(temp), std::begin(longTermDoppler), 0.0)

      Fourth, the vector longTerm is weird:

      • It's instantiated (l. 140), but never populated, so it should be empty.
      • It's size is tested in an assert (l. 171) Why doesn't this fire?
      • It's indexed (above, or the patch l. 249, or the original l. 243). How can that work if it's empty?
      Edited by Peter Barnes
  • reopened

  • Gabriel Ferreira changed the description

    changed the description

  • added 1 commit

    Compare with previous version

  • Eduardo Almeida
  • I have been looking and testing this for some time. All tests are passing. Checked ThreeGppSpectrumPropagationLossModel logs up to 0.22s (800k lines), and they are matching perfectly.

    Tried a few different seeds, but still can't explain the difference in the number of events.

    Testing with

    lena-lte-comparison-user --simTag=test2-user --trafficScenario=2 --simulator=5GLENA --technology=LTE --numRings=2 --ueNumPergNb=5 --calibration=false --freqScenario=0 --operationMode=FDD --direction=UL --RngRun=1 --RngSeed=1
    Seed Original (A) Modded (B) Difference (A-B)
    1 19883834 19818277 65557
    2 19534947 19505701 29246
    3 19458601 19402595 56006

    The maximum error found when comparing the two approaches was like 6.31089e-29...

    Edited by Gabriel Ferreira
  • Gabriel Ferreira added 42 commits

    added 42 commits

    Compare with previous version

  • Gabriel Ferreira marked this merge request as ready

    marked this merge request as ready

  • added 1 commit

    • 0f627666 - spectrum: precompute channel delay sincos

    Compare with previous version

  • Still a long way to go to get this perfect but 37%->10% of time spent on sincos is a win in my book.

    CPU Cycles

    image

    It can probably get to 2x speedup with improvements to the data layout, since most of the time now is wasted waiting for the data to become available.

    std::complex<double> being 16-byte wide means my 256kB L1d cache can fit only 16k values in the best case scenario.

    The biggest SpectrumValue I saw had 1-2k bands, the cached sincos for the doppler component has that many values * number of clusters (e.g. 16), so we end up flushing the entire L1d cache to compute that last loop.

    Reducing the precision to std::complex<float> gives an additional speedup of ~1.1x, but not sure how much precision is required, so not touching it for now.

    Stalled cycles caused by the backend

    image

    Edited by Gabriel Ferreira
  • Trying fancy floating point compression with Zfp by LLNL. If it works, should give a nice speedup. :smile:

    Update: didn't work... :rolling_eyes:

    Edited by Gabriel Ferreira
  • While Intel might not have the best processors, they certainly do have the best tools. It clearly points the memory bottleneck in

    auto subsbandGain = (doppler.GetValues() * channelParams->m_cachedDelaySincos[i]).sum();

    image

    Also, the P-core vs E-core memory performance difference shows their strategy is garbage (for our use case) and we should probably pin simulations to performance cores.

    Edited by Gabriel Ferreira
  • Sidetracking as usual: A friend of mine has brought to my attention that this difference between the P-Cores (Golden Cove) and E-Cores (Gracemont) performance is actually on the load-store part. There is a missing load unit on the E-Cores, the scheduling from load and store is shared and the number of registers/entries is much smaller. image

    I didn't notice while profiling, but it is actually in the results.

    Profiling from operator* image
    Edited by Gabriel Ferreira
  • Gabriel Ferreira added 245 commits

    added 245 commits

    Compare with previous version

  • added 1 commit

    • 1d1899e0 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • added 1 commit

    • b21de954 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • added 1 commit

    • 147ffdac - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • For the cttc-nr-mimo-demo ~1.36x with

    cttc-nr-mimo-demo --numHPortsGnb=1 --xPolGnb=1 --rankLimit=2 --gnbUeDistance=300 --enableMimoFeedback=1

    Before image

    After image

    Edited by Gabriel Ferreira
  • mentioned in merge request cttc-lena/nr!111 (merged)

  • Gabriel Ferreira mentioned in merge request !1846 (merged)

    mentioned in merge request !1846 (merged)

  • added 54 commits

    Compare with previous version

  • Updated the caching to support for numerology changes in case the number of RBs doesn't change (channel bandwidth grows with the same proportion as RB bandwidth).

  • added 1 commit

    • d3e57dd2 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • added 1 commit

    • 8e0a5b13 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • added 2 commits

    • 045ae0c9 - 1 commit from branch nsnam:master
    • 9b160b7d - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • added 1 commit

    • aae8b5a7 - spectrum: cache sincos in ThreeGppSpectrumPropagationLossModel

    Compare with previous version

  • Eduardo Almeida
  • added 1 commit

    • de5a061f - Update file matrix-based-channel-model.h

    Compare with previous version

  • Gabriel Ferreira resolved all threads

    resolved all threads

  • Please register or sign in to reply
    Loading