How to use Cache in Example I
The inner loop of this kernel has 47 floating point operations, 18 array reads and 4 array writes.
- The reads are of two 9-point stencils (2d neighbors including diagonals) centered at X(I,J) and Y(I,J), and the writes consist of unit-stride stores to the independent arrays AA, DD, RX and RY.
- The two 9-point stencil array references should exhibit good temporal locality provided we can hold three contiguous columns of X and Y simultaneously in the Scache.
- In addition, we need to make sure that the writes to AA, DD, RX, and RY do not interfere with X and Y in the Scache.