Commit 6f9a1551 authored by Steve Leak's avatar Steve Leak

Merge branch 'sleak/cori-app-perf-and-porting' into 'master'

Sleak/cori app perf and porting

See merge request !430
parents 2438ce75 1fe9762d
# Getting Started and Optimization Strategy
There are several important differences between the Cori KNL ("Knight's
Landing", or Xeon Phi) node architecture and the Xeon architecture used
by Cori Haswell nodes or by Edison. This page will walk you through the
high-level steps to prepare an application to perform well on Cori KNL.
## How Cori KNL Differs From Edison or Cori Haswell
Cori KNL is a "many-core" architecture, meaning that instead of a few
cores optimized for latency-sensitive code, Cori KNL nodes have many
(68) cores optimized for vectorized code. Some key differences are:
| Cori Intel Xeon Phi (KNL) | Cori Haswell (Xeon) | Edison (Xeon "Ivybridge") |
| --------------------------|---------------------|---------------------------|
| 68 physical cores on one socket | 16 physical cores on each of two sockets (32 total) | 12 physical cores on each of two sockets (24 total) |
| 272 virtual cores per node | 64 virtual cores per node | 48 virtual cores per node |
| 1.4 GHz | 2.3 GHz | 2.4 GHz |
| 8 double precision operations per cycle | 4 double precision operations per cycle | 4 double precision operations per cycle |
| 96 GB of DDR memory and 16 GB of MCDRAM high-bandwidth memory | 128 GB of DDR memory | 64 GB of DDR memory |
| ~450 GB/sec memory bandwidth (MCDRAM) | | ~100 GB/sec memory bandwidth |
| 512-bit wide vector units | 256-bit-wide vector units | 256-bit wide vector units |
## Important Aspects of an Application to Optimize for Cori KNL
There are three important areas of improvement to consider for Cori KNL:
1. Evaluating and improving your Vector Processing Unit (VPU) utilization and efficiency. As shown in the table above, the Cori processors have an 8 double-precision wide vector unit. Meaning, if your code produces scalar, rather than vector instructions, you miss on a potential 8x speedup. Vectorization is described in more detail in [Vectorization](../vectorization.md).
2. Identifying and adding more node-level [parallelism](../parallelism.md) and exposing it in your application. An MPI+X programming approach is encouraged where MPI represents a layer of internode communication and X represents a conscious intra-node parallelization layer where X could again stand for MPI or for OpenMP, pthreads, PGAS etc.
3. Evaluating and optimizing for your [memory bandwidth and latency](../mem_bw.md) requirements. Many codes run at NERSC are performance limited not by the CPU clock speed or vector width but by waiting for memory accesses. The memory hierarchy in Cori KNL nodes is different to that in Haswell nodes, so while memory bandwidth optimizations will benefit both, different optimizations will benefit each architecture differently.
# Memory Bandwidth
Consider the following loop:
!!! example
```fortran
do i = 1, n
do j = 1, m
c = c + a(i) * b(j)
end do
end do
```
CPUs perform arithmetic by reading an item from each of two registers,
combining them in some way (eg by adding) and putting the result into a
third register. There are only a few registers in the CPU so the above
loop is implemented something like:
fetch c -> r1
(loop)
fetch a(i) -> r2
(loop)
fetch b(j) -> r3
mul r2, r3 -> r4
add r1, r4 -> r1
(repeat)
store r1 -> c
If `a`, `b` and `c` are `double precision` numbers - ie 8 bytes each - then
we must fetch 8 bytes for each multiply, add pair: we have an
*operational intensity* 2/8 = 1/4.
If we [vectorize](vectorization.md) the loop, then with KNL AVX-512 we
can do 8 loop iterations simultaneously. But we need to load 8 values for
`b`, so our operational intensity is still 16/64 = 1/4. Overall we will read
all of `b`, `n` times: for `2*m*n` operations we will local `8*m*n` bytes.
A KNL core can load 16 double precision values from it's L1 cache per cycle,
but if `b` has more that 4096 elements then `8*m*n` values will
need to be fetched from the L2 cache or beyond. As the diagram below
illustrates, fetching from each step further in the memory hierarchy requires
traversing narrower, longer and more-shared links, so the performance is
being limited not by the CPU but by the rate at which the memory system can
deliver work to the CPU.
<a name="knl-mem-pipes"></a>
![knl-mem-pipes](images/knl-mem-pipes.png)
The [roofline model](../programming/performance-debugging-tools/roofline.md)
is a useful tool for identifying whether performance is being limited by the
CPU, memory bandwidth or something else.
If you identify that memory bandwidth is a limiting factor then you can modify
the code to reuse data in L1, for example with the following transform:
!!! example
``` fortran
do jblock = 1, m, block_size
do i = 1, n
do j = jblock, jblock+block_size
c = c + a(i) * b(j)
end do
end do
end do
```
Now, if we choose `block_size` to fit in L1 cache, each subsequent iteration
of the `i` loop will again traverse the part of `b` that is held in L1. We
will still move `8*m*n` bytes over the core/L1 boundary but only `8*(m/block_size)*n`
bytes across the slower L1/L2 boundary.
On KNL, jobs that can fit within, or make good use of, the 16GB/node high-bandwidth
MCDRAM will benefit from it's much higher bandwidth compared to DDR. (On Cori, KNL nodes
are configured to use MCDRAM as a very large last-level cache).
[Process and thread affinity](../jobs/affinity/index.md) is important on KNL primarily
because pairs of cores share a large L2 cache. The narrowest point in the
memory hierarchy on Haswell nodes, illustrated below, is the QPI link between sockets,
affinity is therefore important on Haswell nodes so processes use the most-local memory.
<a name="haswell-mem-pipes"></a>
![haswell-mem-pipes](images/haswell-mem-pipes.png)
......@@ -71,8 +71,10 @@ nav:
- Policy: data/policy.md
- Performance:
- Readiness: performance/readiness.md
- Getting started on KNL: performance/knl/getting-started.md
- Vectorization : performance/vectorization.md
- Parallelism : performance/parallelism.md
- Memory Bandwidth: performance/mem_bw.md
- Profiling : performance/profiling.md
- I/O:
- Overview : performance/io/index.md
......@@ -186,6 +188,8 @@ nav:
- Cori System: systems/cori/index.md
- Interconnect : systems/cori/interconnect/index.md
- KNL Modes : systems/cori/knl_modes/index.md
- Application Porting and Performance:
- Getting started on KNL: performance/knl/getting-started.md
- Edison : systems/edison/index.md
- PDSF:
- PDSF System: systems/pdsf/index.md
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment