Performance left on the table
Hi,
I've exchanged a few emails with @tomhenderson about performance profiling of ns-3.
After porting ns-3 to Windows (MinGW and started working on ClangCL/MSVC), I've seen pretty bad performance, which makes sense as it was never optimized for Windows shenanigans.
I've since started looking for the reasons by looking at long running examples (e.g. wifi-he-network, dsr and wifi-manager).
Here is a short list of problems I've found:
1. ns-3 flushes streams way too much for a single threaded application
In the DSR example, about a third of the simulation time is spent writing to a AsciiPhyReceiveSinkWithContext. If you look into it, you will see a OutputStreamWrapper, which writes to an output trace file. Some platforms may cache this by a bit, but not if you explicitly/implicitly flush it. Windows doesn't cache anything while writing and spends about a third of the CPU time on this.
Best solution I could think of: change OutputStreamWrapper to return a stringstream instead of the ofstream. Writes are kept in memory until explicitly flushed, when the contents of the stringstream are copied to the ofstream. Gabrielcarvfer/ns-3-dev@4ef829a5
2. De-referencing cold data in the middle of hot loops
2.1 De-referencing cold data in dsr example
After running the profiler again, I've looked for high cache miss rates in areas with high CPU utilization.
DsrPassiveBuffer::AllEqual was the perfect candidate for optimization.
The problem resides in the loop https://gitlab.com/nsnam/ns-3-dev/-/blob/master/src/dsr/model/dsr-passive-buff.cc#L117, which de-references a ton of cold data and fails the test most of the time, as all packets are bundled together.
Possible solution: replace the vector with a map of vectors, addressed by destination Ip, as in Gabrielcarvfer/ns-3-dev@256182bc. Results are very good, with 94% reduction of cache misses and the most called function of the module now spends about 2.4% of CPU time instead of 44%. There was not a huge speedup like with the IO.
2.2 De-referencing cold data in wifi-manager-example
I've used one of the argument sets used for testing "--wifiManager=Ideal --standard=802.11ax-5GHz --serverChannelWidth=160 --clientChannelWidth=160 --serverShortGuardInterval=1600 --clientShortGuardInterval=1600 --serverNss=4 --clientNss=4 --stepTime=0.1"
As you can probably see, TtlExceeded is awful. About 20% of CPU time of the entire simulation.
Why is that? Time manipulation/creation.
Where does that come from? Two places: (Packet timestamp + m_maxDelay) and Simulator::Now() in this line https://gitlab.com/nsnam/ns-3-dev/-/blob/master/src/wifi/model/wifi-mac-queue.cc#L97.
Here is the core module stats.
Possible parts of the final solution:
- Cache Simulator::Now() to prevent the creation and destruction of new Time(). This is huge, but not enough as you still need to de-reference cold data/instructions with GetSimulatorImpl, etc, etc. Gabrielcarvfer/ns-3-dev@683ca267
- Simulator::Now() calls outside the loops that call TtlExpired. Gabrielcarvfer/ns-3-dev@11f7cf70
- Precompute timestamp of packet + maxdelay. If the maxDelay doesn't change, you won't spend time recomputing a new Time and then delete it right away. Gabrielcarvfer/ns-3-dev@4324ca9e
How big of a problem it was? It was huge. Here are the result after these changes.
Doesn't seem to have improved much, but it was at this point that I've realized only showing percentages was not the brightest idea. Here are the absolute results.
About 3x reduction in total CPU samples. 18x reduction in core module CPU time for the wifi-manager-example. It didn't speed up the execution in Windows by much, as the stringstream flushes in the end are quite slow. Using a separate thread to flush every 30 seconds or so would probably help.
The reduction in creation/destruction of Time() by caching them is also important for Cairo users. As an example, here are the results for the first 10 seconds of dsr example without the previously mentioned optimizations. Most of the core module CPU time was spent creating Time() instances that would be destroyed right away.
P.S.: I need to re-run the Coz profiler to post the results that indicated these problems in the first place.