Sebastian Ohlmann requested to merge non_temporal_stores into develop Oct 10, 2019

Description

In the C implementation of the FD kernel (operate_inc.c), use non-temporal store intrinsics instead of the normal instruction (only for the aligned case). This avoids loading the output array into the cache because the store instructinos can go directly into memory. For a 25-point stencil, this should save about 1/(25+1), i.e. about 4%, of the cache lines needed while processing. Since the kernels are memory-bound, this should speed up the kernel by about this amount.

I did some tests that show a few percent faster execution (about 3%).

In order to ensure correctness, a fence instruction is needed at the end of the kernel, which makes sure that everything that comes later in the code sees the correctly written memory for the output state of the FD operation.

This MR depends on !673 (merged).

Checklist

I have checked that my code follows the Octopus coding standards

Use non-temporal store instructions

Description

Checklist

Merge request reports