IBM (Marc Snir) considers C = A*B where all are 64 by 64 matrices with triple DO loop to execute C[i][j] = C[i][j] + A[i][k] * B[k][j] running on RS6000 model 590
|
Initial disappointment:
-
IBM ESSL Library runs at 253 megaflops
-
JDK1.1 plus JIT runs at 3.8 megaflops -- off by a factor of 65
|
Runtime checks (array indices in bounds, null pointer checks) could essentially be removed by better compiler
-
removing checks gives you 33.3 megaflops (only a factor of 8)
-
C and Java with same rules give SAME performance
|
Using rectangular arrays (not vector of vectors) gives 44 megaflops (off by a factor of 6)
|
Using Hardware fused multiply add gives 64 megaflops (factor of 4)
|
Remaining factor requires use of associativity to block computation
|