Subject: C467: VGDS: A Distributed Data Structure Framework for Scientific Computation Date: Sun, 01 Oct 2000 16:34:42 -0400 From: Geoffrey Fox Organization: Florida State University To: Jan-Jan Wu BCC: fox@csit.fsu.edu To Jan Jan Wu Dear Janet, I append 3 referee report(s) on your paper C467: VGDS: A Distributed Data Structure Framework for Scientific Computation I would be happy to publish your paper if you addressed the changes suggested by the referees. This looks non-trivial but doable quite quickly to me. Please include a discussion of your changes and their answer to the referees in your resubmittal. If this is persuasive, we can publish your paper without further refereeing. If you have any question as to how hard to work on a particular issue, feel free to contact me. I thank you for your interest in Concurrency.Practice and Experience. Please send all communication -- including the resubmission -- electronically if possible using the address fox@csit.fsu.edu If you should need a "real address", please use: Geoffrey Fox Computational Science and Information Technology Florida State University 400 Dirac Science Library Tallahassee Florida 32306-4130 850-644-4587 but easiest is cell phone 3152546387 Referee One ----------- This is an interesting paper that is well-suited for publication in CP & E. The authors present some good background information and motivation for the VGDS framework, and make a strong case fpr libraries of data structures architecturally layered via the VGDS abstraction - in such a way that the modeling is conceptually clear, while making implementations practicable. The description of the framework itself is comprehensive enough to understand the architecture well. Similarly, the critical issues in implementation are identified in Section 3, followed by a concise description of their practical implementation. I would suggest a slightly more detailed discussion of the coherence issues on page 7. The case study in the following section helps to consolidate the framework; I would like to see discussions of other data structures if possible. The most attractive aspect of the project and the paper is that it includes results for several applications and interestingly, the results indicate a very high level of efficiency (90%). One question I had concerns scalability -- this issue should at least be discussed further or preferably, empirical results on scaling should be provided. Finally, the authors should indicate or comment on the availability of the software, since as a "practice and experience" work, it is very likely to be of interest and value to practitioners. If these minor suggestions are incorporated, the paper will become worthy of publication and I thus recommend acceptance. Referee Two ----------- E: Referee Comments (For Author and Editor) 1. The paper is poorly structured and it is difficult the read and understand. 2. The English should be thoroughly checked, there are numerous typos and complicated sentences which require rephrasing. 3. The results are poorly presented. F: Presentation Changes 1. Introduction The lack of scalable distributed data structures in C++ is not the only reason for the slow acceptance of C++ by the HPC community, maybe the authors need to give a more balanced view. In the introduction the authors mention that the performance issue of distributed data structures needs to be addressed, however there is no evidence that this problem was seriously investigated. The measurements obtained for 4 workstations mentioned in section 5 provide little information about the performance and scalability of distributed data structures. 2. VGS framework a.) Formatting Table 1, move the caption. b.) Section 2.3. Data Decomposition Classes The global data structure are... (change to is ) c.) Section 2.3. Communication Classes Rephrase the following sentence: " Currently VGDS supports a message abstraction that flattens the attributes and contents of a data structure section to be communicated into a liner buffer on the sending side..." Explanation for incremental data redistribution required. " During the course of computation, if the VGDS data object detects that a remapping is necessary (e.g. for load balancing purpose), it invokes the remap method in Mapper, which in turn redistributes the data structure incrementally." 3. Implementation issues a.) Section 3.1 Move Figure 5 to the relevant paragraph in the text i.e. to page 8. The data distribution for regular arrays is missing from Figure 5, although the following sentence from page 8 refers to it. "Figure 5 shows the duplication mechanism for a regular array, an unstructured mesh, and an adaptive Barnes-Hut tree for N-body algorithms." b.) Section 3.2. Instead of "Optimisatin for Good Performance" use "Performance Optimisation" There is no need to split this section into four sub-sections. c.) Section 3.2.1. "Block distribution is oftenly..." change to often used. Typing error procerssors "Each ORB node is assigned a weight equal to the total number of operations performed in updating the states of elements over all its descendant procerssors (processors)." Typing error "At each steop (step) of the load-balancing process it is necessary to move bodies from the overloaded child to the non-overloaded child." d.) Section 3.2.3 Typing error in "traveral". e.) Section 3.2.4. "The communication style in computation..." change to communication pattern. Rephrase paragraph starting with "VGDS employs a randomized scheduling ..." 4. Case Study:BH Tree a.) 4.3. Parallel Implementation What is self-contained? "To make the paper self-contained, we briefly describe our parallel implementation of the BH algorithm, upon which the BH-Tree library is built." Reformat text in Figure 8, line spacing between (b) and (c). b.) 4.4. The Tree Framework Typing error "frameowrk" "To eliminate duplicated programming investments in developing similar tree-based scientific codes, we have developed a VGDS tree frameowrk." c.) Section 4.4.1 Base tree layer Rephrase the paragraph starting with: " Tree reduction computes the data of a tree node according..." d.) Section 4.4.2 Barnes-Hut tree Rephrase the following sentence: "The BH_tree layer supports tree operations required ..." Rephrase the following sentence: "The computation class Interaction (Figure 12) goes through the essential data list and calls for functions to compute body-to-body and body-to-cluster interactions defined by the user of Interaction." e.) Section 4.4.4 What is class 15, typo error in orthogonal. "The ORBPartitioner class 15 inherits the Mapper class and de_nes functions that are specialized for the orthogoal (orthogonal) recursive bisection partitioning method." Repetition from 3.2.4 "The communication phases in tree codes and unstructured mesh codes..." Repetition from 3.2.4 " VGDS employs a randomized protocol for all-to-some communication..." 5. Experimental results Figure 14 is not referenced in this section. Table 2 is in Related work section. More detailed elaboration of the following statement is required: "The VGDS class libraries significantly reduced the code sizes (e.g. for each of the two tree codes, from over ten thousand lines down to a few hundred lines) and the development time (from over six months down to a few days) of these applications, compared with their message-passing C counterparts." Section 5 is very short and does not present enough experimental data to highlight the efficiency of distributed data structures. This is a core section of the paper and it requires significant rework. 6. Related work This section needs rewriting. Figure 15 is not referenced, move Figure 15 and Table 2 to a different section. Rephrase the following sentence: "In a related work by ourselves [6], we report abstractions of adaptive load balancing mechanisms and complex, many-to-many communications as C++ classes for supporting HPC challenging applications." Typing error in matrix "Instead of tackling one particular data structure such as arrays or matrics ..." Missing reference "This looks very much like the structure of the Nexus run-time system which provides support for CC++ [] where multiple threads are mapped onto a context (Vnode in POOMA) and multiple contexts are mapped onto a physical node." Clarify the following sentence "Indeed, our hand-written prototypes substantially outperform all the efforts mentioned, while allowing greater generality." 7.Conclusion Can you provide some data to support the following statement "The penalty due to object orientation is about 10% compared with its C-only implementation." Typing error in reducing "We are currently investigating possible approaches to reduing such overhead." Referee Three ----------- General Comments ---------------- - The contribution of the paper seems to be a description of an object-oriented framework that integrates well-known technology from a number of areas. - The way in which a user actually works with this framework and specifies an application is not completely clear. At which level will an algorithm be specified? Does the user have to fill in low-level details such as the data-to-processor mapping (Fig. 2) explicitly or is this routine automatically generated from other information such as the data distribution? - The rather limited experiments described in the paper do not adequately address the issue of performance. Nor does the rest of the paper deal with this issue in a proper way. Just one isolated example: does the system provide support for an detection of invariant communication schedules in loops, either through compiler or runtime technology or via appropriate attributes? - Related work in languages, compilers, and runtime systems is NOT adequately treated -- in particular, as far as performance issues are concerned. - The English must be improved (this is not a major problem) Specific Comments ----------------- P1: The comment regarding HPF is not adequate. In particular, HPF has *not* been accepted by the US user community, the major reason being performance problems related to HPF-1 and poor quality of commercial compilers. Fortran 95 has structures and pointers and can represent arbitrary data structures; HPF-2 deals with irregular data structures and distributions, and irregular data accesses via pointers or subscripted subscripts. The goal of portability expressed in the last paragraph is difficult to achieve in view of different architecture types (SMP, distributed memory, clusters of SMPs). In fact the paper deals only with distributed memory architectures explicitly. P3: The "novel" algorithmic strategies mentioned in the second paragraph are relevant but not at all new. P4: The derivation of an unstructured mesh class from the Graph class is not as simple as indicated. A number of sophisticated libraries deal with such grids. P5: It is not clear which data distributions are provided by the framework and what kind of automatic support is associated with them. Similarly for communication, it is unclear how logical send- and receive sets are derived within the Mapper class. P6: Similar to above: must the user specify a remapping explicitly (or is it generated automatically based on a set of known parameters)? The description of how communication is handled in the system is unclear. Is the user responsible for handling communication (this seems to be the case from the description in 3.2.3) or are communication schedules derived automatically. The VGDS systems does not seem to support communication schedule reuse. P7: The "data coherence" scheme describes a simple variant of the well-known overlap/shadow concept and seems to be restricted to neighborhood communication. This is not general enough for the kinds of problems discussed in the paper. P8: Figure 5 does not show boundary duplication for a regular array as mentioned in the text. The algorithmic description presented in Figure 6 is very imprecise. Again the important issue of whether a new communication schedule has to be computed for an iteration (e.g. if a remapping occurred or data access patterns changed) is not discussed at all. P17: The experimental results are limited and unconvincing. The authors mention that their implementation of the Barnes-Hut code with VGDS outperforms all previous approaches described in the literature. There is however no comparison to other implementations supporting this claim. P17/19: Related work is not discussed adequately -- concerning language-related work (HPF and C++ oriented), as well as compiler and runtime technology, and work on graph and tree partitioning (e.g. METIS).