I would recommend publication but I have two comments
a) The authors propose a factorized model between CPU, L1 cache
   and memory. Is this approach valid for all applications or
   does its success depend on the special characteristics of
   matrix multiplication which has unusually large number of
   CPU operations per memory access
   Some comment here would be useful.
   
b) The figures in the results section 7 don't show well in black and white.
   They should be improved.