Referee's Report Concurrency and Computation: Practice and Experience Paper: Efficient Communication Using Message Prediction for Cluster of Multiprocessors Authors: A. Afsahi, N. Dimopoulos Referee Recommendation: 2. accepted provided changes are made as suggested Referee Comments for Authors: The paper describes several methods about which message is going to be consumed next by analysing the history of messages in the past and by detecting certain patterns which are assumed to occur in the future again. This particular aspect is certainly very interesting also I am unsure whether it really needs such a lenghty report on it. It would be much more interesting if you could also combine it with the other two aspects you mention in your paper: - deciding where an dhow this message is to be moved in the cache - efficient cache-remapping and late binding mechanism Your paper basically detects some communication pattern. It would be much more interesting if your analysis would also cause some specific improvement. Also the paper is full of details in contains many places that are very difficult to understand. I comment on it later. Related work is good but misses some very important work at the compiler level which is directly related to the fundamental problem you are addressing namely reducing the effect of performance loss due to communication. Your paper states that performance is lost due to extra copies if the receive has not yet invoked at the time when the send puts a message at the receiver site. Well there is tons of compiler work that tries to avoid this by moving non-blocking receives and sends far up in the code as possible and move blocking wait as far down in the code as possible. By doing so the chances are increased that the corresponding receive has been invoked before the send arrives at the receiver. Moreover, through latency overlapping communication costs can be in the best case eliminated all together. These optimizations can be done at the program level by the compiler but are very hard to realize by systems close to the hardware or the message passing library. I really believe that this work should be mentioned in your paper. Please include at least the work done by M. Gupta, et al. A unified framework for optimizing communication in data parallel programs. Ieee TPDS, 7(7), July 96 T. Fahringer et al. Buffer-Safe and Cost-Driven Communication Optimizing Journal of Parallel and Distributed Computing, Academic Press, 57(1), April 99 Now some more comments and questions about your paper: YOu mention several times that it is unclear at the send site to which final receive buffer address a message has to be sent. In my opinion the receiver address is clear and unique in almost all cases. Only the memory reference to where the data is placed may be unclear if the recv has not yet been invoked. Please make it clear whether this is exactly what you mean. Sometimes I had the feeling that claim that not even the receiver process is known by the sender which I doubt for realistic message passing programs. Everything is possible but your paper is not clear about what you are actually targeting with your work. On page 10 you state that communication traces does not affect the communication patterns. This is correct but it still may have an impact on the performance because it may further delay a send thus avoid extra message copies at the receive site. Same page, last 2 sentences are unclear. Why is this clear about BT, SP and CG applications. Also the last sentence is not clear. YOu just state without explaining why? If someone doesn't know the codes it is not clear. fig. 5 on page 14: I don't understand why there is no curve for window size 0 til 1 or 2 for SP code. Also why is LRU and Fifo zero for the range between 0 and 40 of window size for BT code. The drawings of this figure are realy not well explained. You basically just describe what one sees but the behavior is not explained. typo on page 15. replace "in he Tag" by "in the Tag". Section 6.2. I did not understand whether for the tagging predictor you only look at specific communication receive call and compare the patter for only this pattern. This means that you are evaluating hits/misses for the same recv call but for difference execution instances. YOu don't evaluate across differenct receive calls, right ? How accurate are the functions on top of page 21 ? Please add an experiment that demonstrates accuracy.