Review of “CRAFT: A Framework for F90/HPF Compiler Optimizations” by Chen et al. Referee 1 C. Summary Information Paper Number: C345 Paper Title: CRAFT: A Framework for F90/HPF Compiler Optimizations Author(s): Marina Chen, James Cowie, and Jan-Jan Wu Referee Name: Charles Koelbel Referee Address: CRPC, Mail Stop 132 Rice University, 6100 Main Street Houston, TX 77005 Signature: Referee Recommendations: (3) reject D. Referee’s Comments (for editor only) I have difficulty giving a firm answer on this paper. The work is important, and a nice summary of Jan-Jan’s thesis. But there’s not much new here, the organization and presentation could use some help, and the comparisons to related work are pretty dated. I feel like I’ve read this before, but that’s probably due to having seen the conference versions of various parts. (I looked at the last year of Scientific Programming, where I thought a CRAFT summary had appeared, but didn’t find anything. My mind must be going…) I’m rejecting it because it looks very hard to sufficiently improve the presentation, and it’s not clear that the final product will be interesting even then. E. Referee’s Comments (for Author and Editor) Paper Number:C345 The paper provides an overview of the authors’ work over the past several years on CRAFT, an optimization system for HPF and similar languages. The work is fundamentally sound; all the transformations are correct and seem feasible to implement. The benchmarks, although targeted at kernels rather than full programs, are encouraging. However, the overall analysis of the system leaves something to be desired. Most comparisons are versus strawmen (e.g. measuring only the “transpose” version of ADI, when on many machines the pipelined version is faster) or against outdated systems (e.g. only 6 out of 66 papers in the bibliography are from 1995 or 1996, and two of those are self-references). As a result, I cannot enthusiastically recommend this paper, although I cannot roundly condemn it either. Some more specialized points of content follow. Page 1: “…work on optimization for data movement and node code performance is mostly done in the context of specialised, hand-crafted code written in assembly code…” Not true, see the work by Lam and Anderson for many higher-level examples of this. Also, are many other operations that require data redistribution than the ones listed here (just to name one, REDISTRIBUTE is a long-range effect). Page 3: The description of HPF data mapping directives is neither accurate nor clear. HPF uses a two-stage mapping scheme (three stages, if the mapping to physical processor numbers is included). If one of these stages is not specified, then the compiler is forced to generate it internally–not (as suggested here) to “inherit” incremental information from another procedure. Given the authors’ experience with automatically-generated mappings, this is a strange constraint to put on the compiler. Page 4: The communication algebra seems to be identical to Li and Chen (cited in the bibliography). If so, this should be noted here; if not, please explain what the novel features here are. Page 7: Please give the rules for the algebra, or a reference to them! These are the key to ensuring that XFORM-1 is correct. Also, the cost model for communication is extraordinarily crude–just counting the number of operators implies that a shift by one causes as much communication as a full transpose operation. Even Jingke Li’s thesis in 1990 used a much better approximation. Page 10: “The length of a communication expression only depends on the number of levels in nested procedure calls, which normally is a small constant (less than 4 in most application programs).” I have no doubt this statement is true for the programs tested, but it seems unlikely to be true for full-scale Fortran programs like NASTRAN. Also, is the call graph depth measured statically or dynamically? In other words, if the program is recursive does the communication expression grow without bounds? This is a real concernnce HPF does have recursion and it is a natural expression for modern numerical algorithms such as multigrid relaxation and wavelet transforms. Page 11: “Even though in HPF parallel loops can be explicitly given by the user, data partitioning (specified by DISTRIBUTE directives( can expose further opportunities for increasing parallelism.” This is not exactly true; the parallelism comes from the operations on the data, not the data layout. (For example, ADI performs one pass of sequential sweeps over a distributed dimension, as would a naive version of 2-D FFT.) Page 15: The explanation of BLOCK and CYCLIC distributions in butterfly communication patterns is backwards. BLOCK distributions are good for small offsets and die on large ones, while CYCLIC has the advantage on large offsets. Page 16: The interleaved reduction optimization is interesting, but seems to be rediscovering blocked matrix operations that linear algebra researchers have been using for a couple decades. How doe this idea compare with Lam et al.’s work on choosing block sizes, or Kennedy and Carr’s work on automatic blocking for cache? Page 19: The examples of dependence are confused. There are no backward dependences in the examples (indeed, a backwards dependence would be one of the anti-dependences eliminated in an earlier stage of the algorithm). Page 20: Where did k in procedure interleaved-reduction come from? Page 21: The list of meta-transformations refers to two procedures (boundary-separate and increase-granularity) that do not appear in the paper. Page 22: The description of the CRAFT compiler is very confusing. You describe a SemRep-to-SemRep transformation system, with hand-compilation of SemRep to CM-5 code. But which CM-5 code–Assembler? C calling CMMD? Nodal CM-Fortran? Whatever your back-end translator is, please tell us the compiler/assembler version number. Also, the introduction promises a CRAFT compiler for the CM-2, but it does not appear here. (Not a big loss, but you should correct the intro…) Page 29: “The Fortran D compiler [32, 60] currently only handle [sic] a small subset of HPF’s data layouts…” Note that the papers listed are over 4 years old. There has been some work done in the meantime. Page 30: “By formulating data movement using linear transformations, optimization for non-linear alignments, such as CSHIFT and replication, and data redistributions are not possible.” This is why the Fortran 90D group used pattern matching for those cases. Page 30: “The Vienna Fortran compier [11,66] currently only supports arbitrary rectilinear block distributions.” That may have been true 3 years ago, but the compiler has since improved. Page 30: “The ADAPT systme also simplifies data movement problem by restricting itself to an universal communication model. [sic]” Come again? A universal model is restricted? In any case, what does this difference imply for comparing ADAPT to CRAFT? Page 30: With all the commercial systems, I would think that the Portland Group HPF compiler would be mentioned. NAS is now quoting benchmark numbers from it that are within 50% of optimized MPI benchmarks. F. Presentation Changes Paper Number: C345 Reference Changes/Additions: Despite having generally outdated references, a number of older research projects were ignored. Pingali and Rogers described pipelining (closy related, if not identical, to processor-memory skewing) in papers leading up to Rogers’ 1990 thesis. Sadayappan and Ramanujam tiled loops for communication, producing wavefront patterns not dissimilar to the skewing here. Tseng’s thesis also has material on pipelining loops. Sadayappan et al. have looked at tensor-product formulations of communication generation that can apparently express the block-cyclic permutation, although it is unclear whether these can be automated in a practical system. Kennedy and von Hanxleden produced the first dataflow communication placement algorithm, which has since been extended by Saltz et al. and Kennedy and Sethi. See above for other work that should be mentioned. Other Comments: Check the labels on figures and loop nests again; many appear incorrect in the text. Do a thorough grammar check on the document. How do the phases in Figure 11 relate to those in Figure 1? I had a lot of trouble getting intuition about many parts of the paper: how the communication algebra worked, what order the optimized interleaved reduction loop performed operations in, why block-cyclic permutation requires P&V transformations. I don’t know what to advise in order to improve this. It almost seems that the authors are too close to the work, and need to step back to see what is really important.