The inner loop of this kernel has 47 floating point operations, 18 array reads and 4 array writes.
-
The reads are of two 9-point stencils (2d neighbors including diagonals) centered at X(I,J) and Y(I,J), and the writes consist of unit-stride stores to the independent arrays AA, DD, RX and RY.
-
The two 9-point stencil array references should exhibit good temporal locality provided we can hold three contiguous columns of X and Y simultaneously in the Scache.
-
In addition, we need to make sure that the writes to AA, DD, RX, and RY do not interfere with X and Y in the Scache.
|