Subject: C467: VGDS: A Distributed Data Structure Framework for Scientific
     Computation
Date: Sun, 01 Oct 2000 16:34:42 -0400
From: Geoffrey Fox <fox@csit.fsu.edu>
Organization: Florida State University
To: Jan-Jan Wu <wuj@iis.sinica.edu.tw>
BCC: fox@csit.fsu.edu

To Jan Jan Wu

Dear Janet,

I append 3 referee report(s) on your paper
C467: VGDS: A Distributed Data Structure Framework for Scientific Computation

I would be happy to publish your paper if you addressed the changes
suggested by the referees. This looks non-trivial but doable quite quickly to me.
Please include a discussion of your changes and their answer to the
referees in your resubmittal. If this is persuasive, we can publish your paper
without further refereeing. If you have any question as to how hard to work
on a particular issue, feel free to contact me.

I thank you for your interest in Concurrency.Practice and Experience.
Please send all communication -- including the resubmission -- electronically
if possible using the address
fox@csit.fsu.edu

If you should need a "real address", please use:
Geoffrey Fox
Computational Science and Information Technology
Florida State University
400 Dirac Science Library
Tallahassee Florida 32306-4130
850-644-4587 but easiest is cell phone 3152546387

Referee One
-----------
        This is an interesting paper that is well-suited for
        publication in CP & E. The authors present some good
        background information and motivation for the VGDS
        framework, and make a strong case fpr libraries of
        data structures architecturally layered via the
        VGDS abstraction - in such a way that the modeling
        is conceptually clear, while making implementations
        practicable. The description of the framework itself
        is comprehensive enough to understand the architecture
        well. Similarly, the critical issues in implementation
        are identified in Section 3, followed by a concise
        description of their practical implementation. I would
        suggest a slightly more detailed discussion of the
        coherence issues on page 7. The case study in the
        following section helps to consolidate the framework;
        I would like to see discussions of other data structures
        if possible. The most attractive aspect of the project and
        the paper is that it includes results for several
        applications and interestingly, the results indicate
        a very high level of efficiency (90%). One question
        I had concerns scalability -- this issue should at
        least be discussed further or preferably, empirical
        results on scaling should be provided. Finally,
        the authors should indicate or comment on the availability
        of the software, since as a "practice and experience"
        work, it is very likely to be of interest and
        value to practitioners. If these minor suggestions
        are incorporated, the paper will become worthy of
        publication and I thus recommend acceptance.

Referee Two
-----------


        E: Referee Comments (For Author and Editor)

        1. The paper is poorly structured and it is difficult the read and
        understand.

        2. The English should be thoroughly checked, there are numerous typos
        and
           complicated sentences which require rephrasing.

        3. The results are poorly presented.

        F: Presentation Changes

        1. Introduction
        The lack of scalable distributed data structures in C++ is not the only
        reason for the
        slow acceptance of C++ by the HPC community, maybe the authors need to
        give a more balanced view.

        In the introduction the authors mention that the performance issue of
        distributed data structures
        needs to be addressed, however there is no evidence that this problem
        was seriously
        investigated. The measurements obtained for 4 workstations mentioned in
        section 5 provide
        little information about the performance and scalability of distributed
        data structures.

        2. VGS framework
        a.) Formatting Table 1, move the caption.

        b.) Section 2.3. Data Decomposition Classes
        The global data structure are... (change to is )

        c.) Section 2.3. Communication Classes

        Rephrase the following sentence:
        " Currently VGDS supports a message abstraction that flattens the
        attributes and contents of a
        data structure section to be communicated into a liner buffer on the
        sending side..."

        Explanation for incremental data redistribution required.
        " During the course of computation, if the VGDS data object detects that
        a remapping is necessary (e.g. for load
        balancing purpose), it invokes the remap method in Mapper, which in turn
        redistributes the data structure
        incrementally."

        3. Implementation issues

        a.) Section 3.1
        Move Figure 5 to the relevant paragraph in the text i.e. to page 8.

        The data distribution for regular arrays is missing from Figure 5,
        although
        the following sentence from page 8 refers to it.
        "Figure 5 shows the duplication mechanism for a regular array, an
        unstructured mesh, and an adaptive
        Barnes-Hut tree for N-body algorithms."

        b.) Section 3.2.
        Instead of "Optimisatin for Good Performance" use "Performance
        Optimisation"
        There is no need to split this section into four sub-sections.

        c.) Section 3.2.1.
        "Block distribution is oftenly..." change to often used.

        Typing error procerssors
        "Each ORB node is assigned a weight equal to the total number of
        operations performed in updating the states of
        elements over all its descendant procerssors (processors)."

        Typing error
        "At each steop (step) of the load-balancing process it is necessary to
        move bodies from the overloaded child to the
        non-overloaded child."

        d.) Section 3.2.3
        Typing error in "traveral".

        e.) Section 3.2.4.
        "The communication style in computation..."  change to communication
        pattern.

        Rephrase paragraph starting with "VGDS employs a randomized scheduling
        ..."

        4. Case Study:BH Tree

        a.) 4.3. Parallel Implementation
        What is self-contained?
        "To make the paper self-contained, we briefly describe our parallel
        implementation of the BH algorithm, upon which
        the BH-Tree library is built."

        Reformat text in Figure 8, line spacing between (b) and (c).

        b.) 4.4. The Tree Framework
        Typing error "frameowrk"
        "To eliminate duplicated programming investments in developing similar
        tree-based scientific codes, we have
        developed a VGDS tree frameowrk."

        c.) Section 4.4.1 Base tree layer
        Rephrase the paragraph starting with:
        " Tree reduction computes the data of a tree node according..."

        d.) Section 4.4.2 Barnes-Hut tree
        Rephrase the following sentence:
        "The BH_tree layer supports tree operations required ..."

        Rephrase the following sentence:
        "The computation class Interaction (Figure 12) goes through the
        essential data list and calls for functions to
        compute body-to-body and body-to-cluster interactions defined by the
        user of Interaction."

        e.) Section 4.4.4
        What is class 15, typo error in orthogonal.
        "The ORBPartitioner class 15 inherits the Mapper class and de_nes
        functions that are specialized for the orthogoal
        (orthogonal) recursive bisection partitioning method."

        Repetition from 3.2.4
        "The communication phases in tree codes and unstructured mesh codes..."

        Repetition from 3.2.4
        " VGDS employs a randomized protocol for all-to-some communication..."

        5. Experimental results

        Figure 14 is not referenced in this section. Table 2 is in Related work
        section.

        More detailed elaboration of the following statement is required:
        "The VGDS class libraries significantly reduced the code sizes (e.g. for
        each of the two
        tree codes, from over ten thousand lines down to a few hundred lines)
        and the development time
        (from over six months down to a few days) of these applications,
        compared with their
        message-passing C counterparts."

        Section 5 is very short and does not present enough experimental data to
        highlight the
        efficiency of distributed data structures. This is a core section of the
        paper and it
        requires significant rework.

        6. Related work

        This section needs rewriting. Figure 15 is not referenced, move Figure
        15 and Table 2 to a different section.

        Rephrase the following sentence:
        "In a related work by ourselves [6], we report abstractions of adaptive
        load balancing mechanisms and complex,
        many-to-many communications as C++ classes for supporting HPC
        challenging applications."

        Typing error in matrix
        "Instead of tackling one particular data structure such as arrays or
        matrics ..."

        Missing reference
        "This looks very much like the structure of the Nexus run-time system
        which provides support for CC++ [] where
        multiple threads are mapped onto a context (Vnode in POOMA) and multiple
        contexts are mapped onto a physical node."

        Clarify the following sentence
        "Indeed, our hand-written prototypes substantially outperform all the
        efforts mentioned, while allowing greater
        generality."

        7.Conclusion

        Can you provide some data to support the following statement
        "The penalty due to object orientation is about 10% compared with its
        C-only implementation."

        Typing error in reducing
        "We are currently investigating possible approaches to reduing such
        overhead."

Referee Three
-----------

General Comments
----------------

- The contribution of the paper seems to be a description
  of an object-oriented framework that integrates well-known
  technology from a number of areas.

- The way in which a user actually works with this framework
  and specifies an application is not completely clear.
  At which level will an algorithm be specified?
  Does the user have to fill in low-level details such as the
  data-to-processor mapping (Fig. 2) explicitly or is this routine
  automatically generated from other information such as the
  data distribution?

- The rather limited experiments described in the paper do not
  adequately address the issue of performance. Nor does the rest
  of the paper deal with this issue in a proper way. Just one
  isolated example: does the system provide support for an detection of
  invariant communication schedules in loops, either through
  compiler or runtime technology or via appropriate attributes?

- Related work in languages, compilers, and runtime systems is NOT
  adequately treated -- in particular, as far as performance issues
  are concerned.

- The English must be improved (this is not a major problem)

Specific Comments
-----------------

P1: The comment regarding HPF is not adequate. In particular,
HPF has *not* been accepted by the US user community, the
major reason being performance problems related to HPF-1
and poor quality of commercial compilers.
Fortran 95 has structures and pointers and can represent
arbitrary data structures;  HPF-2 deals with irregular
data structures and distributions, and irregular data accesses
via pointers or subscripted subscripts.

The goal of portability expressed in the last paragraph is difficult
to achieve in view of different architecture types (SMP, distributed
memory, clusters of SMPs). In fact the paper deals only with
distributed memory architectures explicitly.

P3: The "novel" algorithmic strategies mentioned in the second
paragraph are relevant but not at all new.

P4: The derivation of an unstructured mesh class from the Graph
class is not as simple as indicated. A number of sophisticated
libraries deal with such grids.

P5: It is not clear which data distributions are provided by the
framework and what kind of automatic support is associated with them.
Similarly for communication, it is unclear how logical send- and receive sets
are derived within the Mapper class.

P6: Similar to above: must the user specify a  remapping explicitly
(or is it generated automatically based on a set of known parameters)?
The description of how communication is handled in the system is unclear.
Is the user responsible for handling communication (this seems to be the case
from the description in 3.2.3) or are communication schedules derived automatically.
The VGDS systems does not seem to support communication schedule reuse.

P7: The "data coherence" scheme describes a simple variant of the
well-known overlap/shadow concept and seems to be restricted to
neighborhood communication. This is not general enough for the
kinds of problems discussed in the paper.

P8: Figure 5 does not show boundary duplication for a regular array as mentioned
in the text.  The algorithmic description presented in Figure 6 is very imprecise.
Again the important issue of whether a new communication schedule has to be
computed for an iteration (e.g. if a remapping occurred or data access patterns
changed) is not discussed at all.

P17: The experimental results are limited and unconvincing.
The authors mention that their implementation of the Barnes-Hut code with VGDS
outperforms all previous approaches described in the literature.
There is however no comparison to other implementations supporting this claim.

P17/19: Related work is not discussed adequately -- concerning
language-related work (HPF and C++ oriented), as well as compiler
and runtime technology, and work on graph and tree partitioning (e.g. METIS).