Review of “CRAFT: A Framework for F90/HPF 
Compiler Optimizations” by Chen et al.

Referee 1

C. Summary Information
Paper Number: C345
Paper Title: CRAFT: A Framework for F90/HPF Compiler Optimizations
Author(s): Marina Chen, James Cowie, and Jan-Jan Wu
Referee Name: Charles Koelbel
Referee Address: 
	CRPC, Mail Stop 132
	Rice University,
	6100 Main Street
	Houston, TX 77005

Signature:

Referee Recommendations:
	(3) reject
D. Referee’s Comments (for editor only)
I have difficulty giving a firm answer on this paper. The work is important, and a nice 
summary of Jan-Jan’s thesis. But there’s not much new here, the organization and 
presentation could use some help, and the comparisons to related work are pretty dated. I 
feel like I’ve read this before, but that’s probably due to having seen the conference 
versions of various parts. (I looked at the last year of Scientific Programming, where I 
thought a CRAFT summary had appeared, but didn’t find anything. My mind must be 
going…) I’m rejecting it because it looks very hard to sufficiently improve the 
presentation, and it’s not clear that the final product will be interesting even then. 

E. Referee’s Comments (for Author and Editor)
Paper Number:C345
The paper provides an overview of the authors’ work over the past several years on 
CRAFT, an optimization system for HPF and similar languages. The work is 
fundamentally sound; all the transformations are correct and seem feasible to implement. 
The benchmarks, although targeted at kernels rather than full programs, are encouraging. 
However, the overall analysis of the system leaves something to be desired. Most 
comparisons are versus strawmen (e.g. measuring only the “transpose” version of ADI, 
when on many machines the pipelined version is faster) or against outdated systems (e.g. 
only 6 out of 66 papers in the bibliography are from 1995 or 1996, and two of those are 
self-references). As a result, I cannot enthusiastically recommend this paper, although I 
cannot roundly condemn it either.
Some more specialized points of content follow.
Page 1: “…work on optimization for data movement and node code performance is 
mostly done in the context of specialised, hand-crafted code written in assembly code…”  
Not true, see the work by Lam and Anderson for many higher-level examples of this. 
Also, are many other operations that require data redistribution than the ones listed here 
(just to name one, REDISTRIBUTE  is a long-range effect).
Page 3: The description of HPF data mapping directives is neither accurate nor clear. 
HPF uses a two-stage mapping scheme (three stages, if the mapping to physical processor 
numbers is included). If one of these stages is not specified, then the compiler is forced to 
generate it internally–not (as suggested here) to “inherit” incremental information from 
another procedure. Given the authors’ experience with automatically-generated 
mappings, this is a strange constraint to put on the compiler.
Page 4: The communication algebra seems to be identical to Li and Chen (cited in the 
bibliography). If so, this should be noted here; if not, please explain what the novel 
features here are.
Page 7: Please give the rules for the algebra, or a reference to them! These are the key to 
ensuring that XFORM-1 is correct. Also, the cost model for communication is 
extraordinarily crude–just counting the number of operators implies that a shift by one 
causes as much communication as a full transpose operation. Even Jingke Li’s thesis in 
1990 used a much better approximation.
Page 10: “The length of a communication expression only depends on the number of 
levels in nested procedure calls, which normally is a small constant (less than 4 in most 
application programs).” I have no doubt this statement is true for the programs tested, but 
it seems unlikely to be true for full-scale Fortran programs like NASTRAN. Also, is the 
call graph depth measured statically or dynamically?  In other words, if the program is 
recursive does the communication expression grow without bounds? This is a real 
concernnce HPF does have recursion and it is a natural expression for modern numerical 
algorithms such as multigrid relaxation and wavelet transforms.
Page 11: “Even though in HPF parallel loops can be explicitly given by the user, data 
partitioning (specified by DISTRIBUTE directives( can expose further opportunities for 
increasing parallelism.” This is not exactly true; the parallelism comes from the 
operations on the data, not the data layout. (For example, ADI performs one pass of 
sequential sweeps over a distributed dimension, as would a naive version of 2-D FFT.)
Page 15: The explanation of BLOCK and CYCLIC distributions in butterfly 
communication patterns is backwards. BLOCK distributions are good for small offsets 
and die on large ones, while CYCLIC has the advantage on large offsets.
Page 16: The interleaved reduction optimization is interesting, but seems to be 
rediscovering blocked matrix operations that linear algebra researchers have been using 
for a couple decades. How doe this idea compare with Lam et al.’s work on choosing 
block sizes, or Kennedy and Carr’s work on automatic blocking for cache?
Page 19: The examples of dependence are confused. There are no backward dependences 
in the examples (indeed, a backwards dependence would be one of the anti-dependences 
eliminated in an earlier stage of the algorithm). 
Page 20: Where did k in procedure interleaved-reduction come from?
Page 21: The list of meta-transformations refers to two procedures (boundary-separate 
and increase-granularity) that do not appear in the paper.
Page 22: The description of the CRAFT compiler is very confusing. You describe a 
SemRep-to-SemRep transformation system, with hand-compilation of SemRep to CM-5 
code. But which CM-5 code–Assembler? C calling CMMD? Nodal CM-Fortran? 
Whatever your back-end translator is, please tell us the compiler/assembler version 
number. Also, the introduction promises a CRAFT compiler for the CM-2, but it does not 
appear here. (Not a big loss, but you should correct the intro…)
Page 29: “The Fortran D compiler [32, 60] currently only handle [sic] a small subset of 
HPF’s data layouts…” Note that the papers listed are over 4 years old. There has been 
some work done in the meantime.
Page 30: “By formulating data movement using linear transformations, optimization for 
non-linear alignments, such as CSHIFT and replication, and data redistributions are not 
possible.” This is why the Fortran 90D group used pattern matching for those cases.
Page 30: “The Vienna Fortran compier [11,66] currently only supports arbitrary 
rectilinear block distributions.” That may have been true 3 years ago, but the compiler has 
since improved.
Page 30: “The ADAPT systme also simplifies data movement problem by restricting 
itself to an universal communication model. [sic]” Come again? A universal model is 
restricted? In any case, what does this difference imply for comparing ADAPT to 
CRAFT?
Page 30: With all the commercial systems, I would think that the Portland Group HPF 
compiler would be mentioned. NAS is now quoting benchmark numbers from it that are 
within 50% of optimized MPI benchmarks.
F. Presentation Changes
Paper Number: C345

Reference Changes/Additions: 
Despite having generally outdated references, a number of older research projects were 
ignored. Pingali and Rogers described pipelining (closy related, if not identical, to 
processor-memory skewing) in papers leading up to Rogers’ 1990 thesis. Sadayappan and 
Ramanujam tiled loops for communication, producing wavefront patterns not dissimilar 
to the skewing here. Tseng’s thesis also has material on pipelining loops. Sadayappan et 
al. have looked at tensor-product formulations of communication generation that can 
apparently express the block-cyclic permutation, although it is unclear whether these can 
be automated in a practical system. Kennedy and von Hanxleden produced the first 
dataflow communication placement algorithm, which has since been extended by Saltz et 
al. and Kennedy and Sethi. See above for other work that should be mentioned.

Other Comments:
Check the labels on figures and loop nests again; many appear incorrect in the text.
Do a thorough grammar check on the document.
How do the phases in Figure 11 relate to those in Figure 1?
I had a lot of trouble getting intuition about many parts of the paper: how the 
communication algebra worked, what order the optimized interleaved reduction loop 
performed operations in, why block-cyclic permutation requires P&V transformations. I 
don’t know  what to advise in order to improve this. It almost seems that the authors are 
too close to the work, and need to step back to see what is really important.