Referee's Report
Concurrency and Computation: Practice and Experience

Paper: Efficient Communication Using Message Prediction for Cluster
	of Multiprocessors
Authors: A. Afsahi, N. Dimopoulos

Referee Recommendation:

2. accepted provided changes are made as suggested

Referee Comments for Authors:

The paper describes several methods about which message is going to be
consumed next by analysing the history of messages in the past and by
detecting certain patterns which are assumed to occur in the future again.
This particular aspect is certainly very interesting also I am unsure
whether it really needs such a lenghty report on it. It would be much
more interesting if you could also combine it with the other two aspects
you mention in your paper:

	- deciding where an dhow this message is to be moved in the cache
	- efficient cache-remapping and late binding mechanism

Your paper basically detects some communication pattern. It would
be much more interesting if your analysis would also cause some specific
improvement.

Also the paper is full of details in contains many places that are very
difficult to understand. I comment on it later. 

Related work is good but misses some very important work at the compiler
level which is directly related to the fundamental problem you are 
addressing namely reducing the effect of performance loss due to communication.
Your paper states that performance is lost due to extra copies if the
receive has not yet invoked at the time when the send puts a message
at the receiver site. Well there is tons of compiler work that tries
to avoid this by moving non-blocking receives and sends far up in
the code as possible and move blocking wait as far down in the code
as possible. By doing so the chances are increased that the corresponding
receive has been invoked before the send arrives at the receiver. Moreover,
through latency overlapping communication costs can be in the best case
eliminated all together. These optimizations can be done at the program level by 
the compiler but are very hard to realize by systems close to the hardware
or the message passing library. I really believe that this work should
be mentioned in your paper.

Please include at least the work done by

M. Gupta, et al. A unified framework for optimizing communication in
	data parallel programs. Ieee TPDS, 7(7), July 96

T. Fahringer et al. Buffer-Safe and Cost-Driven Communication Optimizing
	Journal of Parallel and Distributed Computing,
	Academic Press, 57(1), April 99

Now some more comments and questions about your paper:

YOu mention several times that it is unclear at the send site to which
final receive buffer address a message has to be sent. In my opinion the receiver
address is clear and unique in almost all cases. Only the memory reference
to where the data is placed may be unclear if the recv has not yet been invoked.
Please make it clear whether this is exactly what you mean. Sometimes I had
the feeling that claim that not even the receiver process is known by the sender
which I doubt for realistic message passing programs. Everything is possible but
your paper is not clear about what you are actually targeting with your work.

On page 10 you state that communication traces does not affect the communication patterns.
This is correct but it still may have an impact on the performance because it may
further delay a send thus avoid extra message copies at the receive site.

Same page, last 2 sentences are unclear. Why is this clear about BT, SP and CG applications.
Also the last sentence is not clear. YOu just state without explaining why? If someone
doesn't know the codes it is not clear.

fig. 5 on page 14: I don't understand why there is no curve for window size 0 til 1 or 2
for SP code. Also why is LRU and Fifo zero for the range between 0 and 40 of window size
for BT code. The drawings of this figure are realy not well explained. You basically
just describe what one sees but the behavior is not explained.

typo on page 15. replace "in he Tag" by "in the Tag".

Section 6.2. I did not understand whether for the tagging predictor you only look
at specific communication receive call and compare the patter for only this pattern.
This means that you are evaluating hits/misses for the same recv call but for difference
execution instances. YOu don't evaluate across differenct receive calls, right ?

How accurate are the functions on top of page 21 ? Please add an experiment that
demonstrates accuracy.