I would recommend publication but I have two comments a) The authors propose a factorized model between CPU, L1 cache and memory. Is this approach valid for all applications or does its success depend on the special characteristics of matrix multiplication which has unusually large number of CPU operations per memory access Some comment here would be useful. b) The figures in the results section 7 don't show well in black and white. They should be improved.