Referee 1 **************************************************************** Referee Recommendation: Accepted provided changes are made as follows: 1) Improve description of the parallel implementation, particularly page 7, to provide greater clarity. 2) Fix grammatical errors and some notational errors and acronym definitions. I am returning my copy of the paper with editorial corrections as a help to the authors. 3) Note that Figures 6 and 7 are interchanged, but correctly called out in the text. I don't see how Fig 4b is a rhombic dodecahedron. Maybe I'm missing something? Review This paper is very interesting and an important contribution to the literature on parallel methods for Fluid Particle Models (FPM). However, it needs a great deal of grammatical editing before I would recommend publication. I have edited the text to provide my suggestions as to changes, but the authors should check the journal's standards before resubmitting. I found two figures mislabeled on the Figure (6 and 7 got interchanged). I have double-checked the equations against the Espanol reference (Ref. 17), and I don't see any problems, but must confess that I do not know this field well enough to make an independent check of the equations. The authors do a generally good job of describing the FPM (perhaps providing more than needed), before going into the parallel implementation. I felt the descriptions in the parallel implementation section, particularly page 7, were confusing and sometimes redundant. Tightening up the comparison of FPM and MD would help the "readability" of the paper. Perhaps the use of lower case in the acronym for "mutual nearest neighborhood (mnn)" concept really confused me as I thought that mnn referred to lattice planes! The acronym should be capitalized as: "mutual nearest neighborhood (MNN)" so as to avoid confusion. ccNUMA is not defined when it is first used (page 13), but is defined on page 15. Should be defined when first used. If these changes were made, the paper would be much more readable and would warrant publication. I would suggest that the authors make these corrections and resubmit the paper for review before it is published. Referee 2 **************************************************************** Review of paper Parallel Implementation of the Fluid Particle Model for Simulating Complex Fluids in the Mesoscale by K. Boryczko, W. Dzwinel, and D. Yuen, submitted to Concurrency, Practice and Experience, June 2001. In this paper, the authors describe a parallel implementation of the fluid particle model (FPM) to simulate fluids at the mesoscale. They describe the physical modeling, comparing it with other methods in its class, explain how to discretize and parallelize the algorithm. Results are presented on two architectures (the SGI Origin, and the IBM Power 3.) up to 32 processors. They discuss the relative merit of the two architectures as they relate to the FPM. Simulation results are shown to demonstrate the versatility of both the mathematical model and the parallel numerical implementation. The paper is pretty well written, with some grammatical errors that should be corrected. Some are mentioned in the detailed comment section. Detailed comments: 1) p. 3, 2nd line. "working parallel" à "working in parallel" 2) p. 3, 2nd paragraph. "takes an advantage" à "has an advantage" 3) p. 4, eq. 5. Should be replaced by ? with ? 4) p. 4, eq. 5. What are the question marks? 5) p. 4. Why is defined after eqs. 2-6. The temperature does not appear, unless that is what the questions marks are meant to refer to. Same for . 6) p. 5, 2nd paragraph. Which combination of the three forces defined on p.4 constitute the non-central force? 7) p. 6, top paragraph. Why mention the fact that Kinetic theory considers zero pressure? On the previous page, a formula for pressure is given. So the question is (and it is not clear from the text), is the iterative procedure that is mentioned (used to match coefficients of FPM forces) used by the authors? If not, why not, since the zero pressure would lead to bad matching. 8) p. 6. Assumptions of (18) are partially retrieved from Ref. [17]. What is its physical interpretation. Why make this assumption at all? Then one postulates weight functions to be the same as DPD? The authors should give some justification. Without any justification, one could choose anything. 9) p. 6, eq. 18. What is the "dot" in ? 10) p. 6. What is the limit fo as the fluid becomes continuous? That relates to computational cost in the limit of a continuous fluid. 11) p. 7, 1st paragraph. The P subsystems overlap. This should be mentioned. 12) p. 7, 1st paragraph. Does "node" refer to a node on a multiprocessor machine? If yes, state the number of processors per node on both architectures. If not, what does it mean? 13) Partionning into a mesh of subboxes in x,y,z directions is more efficient because it maximizes computation to communication ratio. 14) p. 7, bottom. When discussing cache misses, more evidences should be given. We don't know the cache sizes of the architectures considered, nor the method by which they are measured. Profiling tools should be used to verify that the assertions made are correct. Also, other types of table arrangements may lead to a lower number of cache misses. 15) p. 8. Exactly how many variables are required in a MD dynamics code, and how many in a FPM code. State a precise number. You do have the codes! Cache reuse could be higher with the FPM (temporal locality) when there are more variables per droplet, offsetting poorer spatial locality. Statements should not be made this glibly. 16) p. 8. Why are predictor corrector methods time consuming? Memory I understand. 17) p. 8. "decided employing" à decided to employ" 18) p. 9. I don't understand the necessity of using higher order for angular momentum due to stability. What is the CFL required for stability if order of accuracy for angular momentum is not increased? 19) p. 9. "box generating artifacts" à "box generating numerical artifacts" 20) p. 10. "code very clumsily" à "code very clumsy" 21) p. 10. Authors mention transition rules. What are they? 22) p. 11. I find the entire discussion on clustering hard to undertand. That could be lack of knowledge on the part of this reviewer. 23) p. 12. Fig. 6 in 2nd paragraph is really Fig. 7 24) p. 12. R14000/500 CPUs or R14000/500 Mhz? 25) p. 12. "In Fig 7a we depict" à "In Fig. 6, we depict" 26) p. 12, last paragraph. When the computational box is increased in the xy plane, does that mean the number of boxes increases proportionately? Authors should explain this paragraph more precisely. ccNuma architectures have the supposed advantage that the memory latency between objects far apart is close to that when objects are close together (SGI version at least). Some evidence of the cost of interacting particles being far from each other in memory should be mentioned. 27) p. 13, paragraph 2. I don't understand the statement regarding "communication delay between distant processor.." on an IBM SP. With the current switches, the cost to transfer data between any two nodes is a constant. If not, authors should elaborate. 28) p. 13. Is the code an MPI code on both machines? Cost of MPI latency is probably much higher than that of memory latency. I don't quite believe the arguments in paragraph 3. Is it not true that maintaining cache coherency on a ccNuma machine is not as much of an issue for an MPI code where the user controls location the data and data communication? To conclude, I believe the paper should be published after the comments above are addressed. Of most concern is the short discussion stating the effects of the algorithm on cache misses, without providing additional corroborative evidence (multiple resolutions, multiple processors, details of the cache size, memory bandwidth of the machines). It seems that in a paper whose major contribution is the parallel implementation of an algorithm, more details about the architecture, and results from profiling tools should be provided. The Origin's have hardware counters that can be used to assess cache misses at different resolutions, when different number of processors are used Referee 3 **************************************************************** E:Referee Comments(For Author and Editor) This paper described a parallel implementation of Fluid Particle model and presented its performance results. This paper is clear and easy to read. The equations seem right though I have not verified. The authors are encouraged to double check. F:Presentation Changes The paper needs work in 1) Figure Captions - It would be useful to put caption below the figures. This way reader will not have to flip pages to find the description. 2) Tables - please reorder the tables such that they are in ascending order. 3) Fig. 6 (CPU and Speedup) - it's quite confusing to have so many graphs on a figure. Perhaps, a separate figure will be better. 4) On page 12, section 3.5, "In Fig. 7a ...". There's no Fig.7a. It should be Fig 6. I believe Fig. 6 and Fig. 7 are in the wrong order. 5) Some citations, i.e., 35-38 are used.