Massively multicore + shared memory doesn’t scale

We discussed 1) why implicit hardware cache coherency doesn’t scale to massively multicore, and 2) hardware cache coherency could hypothetically be replaced with scalable software driven coherency if shared data is immutable and shared only via explicit messages. We’re referring to the shared memory variant of the MIMD computing hardware model.

Additionally the following charts seem to show that (even immutable) shared memory doesn’t scale to massively multicore because of the power consumption required to move so much bandwidth around. The AMD 2990WX isn’t as proportionally performant as the 2950X on tasks which require appreciable memory bandwidth, because the 2990WX connects only half the cores directly to the DDR memory bus. The EPYC 7601 connects all the cores to the memory bus, but the Infinity Fabric core interconnections are consuming the majority of the power.

Analogous to how Internet connectivity is much slower than memory bus connectivity, core interconnectivity bandwidth (and latency) won’t scale proportionally with the increase in cores. So software has to written to minimize unnecessary communication between cores so that computing hardware can be designed for the distributed memory variant of MIMD. Just as with the design of the client-server paradigm, multicore software must be designed to minimize distant (i.e. non-local, not cached) memory resource sharing by keeping needed data close to the core that will work with it. For example, the power consumption is ~6X greater (2 vs. 11 pJ/b) for inter-chip versus on-chip Infinity Fabric interconnections— a ratio that will worsen because AMD isn’t shrinking the inter-chip Infinity Fabric I/O down from 14nm to 7nm when it shrinks the chiplets to 7nm (presumably because compared to the 7nm chiplets, the off-chip I/O consumes so much more power which the 7nm feature size can’t handle without duplication which wastes the 7nm feature size).

For example, mathematically if we have 2 cores that don’t share the same local cache but do share the same global memory, the probability that any one of¹ their local caches will contain the requested portions of the said global memory is less than if each of the cores only access half of the global memory and the said request is directed to the core which addresses the global memory which can fulfill the request. As the number of cores increase the difference in probabilities increases. And thus the interconnection bandwidth increases because the caches have to be filled by communication from another cache.

Given that each request will only be sent to one of the cores. ↩

Edited Nov 30, 2018 by Shelby Moore III