Referee 1 ****************************************************************

Referee Recommendation: Accepted provided changes are made as follows:
1)	Improve description of the parallel implementation, particularly page 7,
to provide greater clarity.
2)	Fix grammatical errors and some notational errors and acronym definitions.
I am returning my copy of the paper with editorial corrections as a help to
the authors.
3)	Note that Figures 6 and 7 are interchanged, but correctly called out in
the text.  I don't see how Fig 4b is a rhombic dodecahedron. Maybe I'm missing
something?

Review 

This paper is very interesting and an important contribution to the literature
on parallel methods for Fluid Particle Models (FPM). However, it needs a great
deal of grammatical editing before I would recommend publication. I have
edited the text to provide my suggestions as to changes, but the authors
should check the journal's standards before resubmitting. I found two figures
mislabeled on the Figure (6 and 7 got interchanged). I have double-checked the
equations against the Espanol reference (Ref. 17), and I don't see any
problems, but must confess that I do not know this field well enough to make
an independent check of the equations.  The authors do a generally good job of
describing the FPM (perhaps providing more than needed), before going into the
parallel implementation. I felt the descriptions in the parallel
implementation section, particularly page 7, were confusing and sometimes
redundant. Tightening up the comparison of FPM and MD would help the
"readability" of the paper. Perhaps the use of lower case in the acronym for
"mutual nearest neighborhood (mnn)" concept really confused me as I thought
that mnn referred to lattice planes! The acronym should be capitalized as:
"mutual nearest neighborhood (MNN)" so as to avoid confusion. ccNUMA is not
defined when it is first used (page 13), but is defined on page 15. Should be
defined when first used. If these changes were made, the paper would be much
more readable and would warrant publication. I would suggest that the authors
make these corrections and resubmit the paper for review before it is
published.

Referee 2 ****************************************************************

Review of paper Parallel Implementation of the Fluid Particle Model for
Simulating Complex Fluids in the Mesoscale by K. Boryczko, W. Dzwinel, and D.
Yuen, submitted to Concurrency, Practice and Experience, June 2001.

In this paper, the authors describe a parallel implementation of the fluid
particle model (FPM) to simulate fluids at the mesoscale. They describe the
physical modeling, comparing it with other methods in its class, explain how
to discretize and parallelize the algorithm. Results are presented on two
architectures (the SGI Origin, and the IBM Power 3.) up to 32 processors. They
discuss the relative merit of the two architectures as they relate to the FPM.
Simulation results are shown to demonstrate the versatility of both the
mathematical model and the parallel numerical implementation.

The paper is pretty well written, with some grammatical errors that should be
corrected. Some are mentioned in the detailed comment section.


Detailed comments:
1)	p. 3, 2nd line. "working parallel" à "working in parallel"
2)	p. 3, 2nd paragraph. "takes an advantage" à "has an advantage"
3)	p. 4, eq. 5. Should   be replaced by  ?   with  ?
4)	p. 4, eq. 5. What are the question marks?
5)	p. 4. Why is   defined after eqs. 2-6. The temperature does not appear,
unless that is what the questions marks are meant to refer to. Same for  .
6)	p. 5, 2nd paragraph. Which combination of the three forces defined on p.4
constitute the non-central force?
7)	p. 6, top paragraph. Why mention the fact that Kinetic theory considers
zero pressure? On the previous page, a formula for pressure is given. So the
question is (and it is not clear from the text), is the iterative procedure
that is mentioned (used to match coefficients of FPM forces) used by the
authors? If not, why not, since the zero pressure would lead to bad matching.
8)	p. 6. Assumptions of (18) are partially retrieved from Ref. [17]. What is
its physical interpretation. Why make this assumption at all? Then one
postulates weight functions to be the same as DPD? The authors should give
some justification. Without any justification, one could choose anything.
9)	p. 6, eq. 18. What is the "dot" in  ?
10)	p. 6. What is the limit fo   as the fluid becomes continuous? That relates
to computational cost in the limit of a continuous fluid.
11)	p. 7, 1st paragraph. The P subsystems overlap. This should be mentioned.
12)	p. 7, 1st paragraph. Does "node" refer to a node on a multiprocessor
machine? If yes, state the number of processors per node on both
architectures. If not, what does it mean?
13)	Partionning into a mesh of subboxes in x,y,z directions is more efficient
because it maximizes computation to communication ratio.
14)	p. 7, bottom. When discussing cache misses, more evidences should be
given. We don't know the cache sizes of the architectures considered, nor the
method by which they are measured. Profiling tools should be used to verify
that the assertions made are correct. Also, other types of table arrangements
may lead to a lower number of cache misses.
15)	p. 8. Exactly how many variables are required in a MD dynamics code, and
how many in a FPM code. State a precise number. You do have the codes! Cache
reuse could be higher with the FPM (temporal locality) when there are more
variables per droplet, offsetting poorer spatial locality. Statements should
not be made this glibly.
16)	p. 8. Why are predictor corrector methods time consuming? Memory I
understand.
17)	p. 8. "decided employing" à decided to employ"
18)	p. 9. I don't understand the necessity of using higher order for angular
momentum due to stability. What is the CFL required for stability if order of
accuracy for angular momentum is not increased?
19)	p. 9. "box generating artifacts" à "box generating numerical artifacts"
20)	p. 10. "code very clumsily" à "code very clumsy"
21)	p. 10. Authors mention transition rules. What are they?
22)	p. 11. I find the entire discussion on clustering hard to undertand. That
could be lack of knowledge on the part of this reviewer.
23)	p. 12. Fig. 6 in 2nd paragraph is really Fig. 7
24)	p. 12. R14000/500 CPUs or R14000/500 Mhz?
25)	p. 12. "In Fig 7a we depict" à "In Fig. 6, we depict"
26)	p. 12, last paragraph. When the computational box is increased in the xy
plane, does that mean the number of boxes increases proportionately? Authors
should explain this paragraph more precisely. ccNuma architectures have the
supposed advantage that the memory latency between objects far apart is close
to that when objects are close together (SGI version at least). Some evidence
of the cost of interacting particles being far from each other in memory
should be mentioned.
27)	p. 13, paragraph 2. I don't understand the statement regarding
"communication delay between distant processor.." on an IBM SP. With the
current switches, the cost to transfer data between any two nodes is a
constant. If not, authors should elaborate.
28)	p. 13. Is the code an MPI code on both machines? Cost of MPI latency is
probably much higher than that of memory latency. I don't quite believe the
arguments in paragraph 3. Is it not true that maintaining cache coherency on a
ccNuma machine is not as much of an issue for an MPI code where the user
controls location the data and data communication?


To conclude, I believe the paper should be published after  the comments above
are addressed. Of most concern is the short discussion stating the effects of
the algorithm on cache misses, without providing additional corroborative
evidence (multiple resolutions, multiple processors, details of the cache
size, memory bandwidth of the machines). It seems that in a paper whose major
contribution is the parallel implementation of an algorithm, more details
about the architecture, and results from profiling tools should be provided.
The Origin's have hardware counters that can be used to assess cache misses at
different resolutions, when different number of processors are used


Referee 3 ****************************************************************

E:Referee Comments(For Author and Editor)

This paper described a parallel implementation of Fluid Particle model and
presented its performance results. This paper is clear and easy to read. The
equations seem right though I have not verified. The authors are encouraged to
double check.

F:Presentation Changes 

The paper needs work in 

1) Figure Captions - It would be useful to put caption below the figures. This
way reader will not have to flip pages to find the description.

2) Tables - please reorder the tables such that they are in ascending order. 

3) Fig. 6 (CPU and Speedup) - it's quite confusing to have so many graphs on a
figure. Perhaps, a separate figure will be better.

4) On page 12, section 3.5, "In Fig. 7a ...". There's no Fig.7a. It should be
Fig 6. I believe Fig. 6 and Fig. 7 are in the wrong order.

5) Some citations, i.e., 35-38 are used.