Subject: final paper submission for CPE Resent-Date: Sun, 07 Mar 1999 10:05:01 -0500 Resent-From: Geoffrey Fox ?gcf@npac.syr.edu? Resent-To: p_gcf@boss.npac.syr.edu Date: Fri, 26 Feb 1999 18:47:52 +0800 (CST) From: Jan-Jan Wu ?wuj@iis.sinica.edu.tw? To: gcf@npac.syr.edu CC: wuj@iis.sinica.edu.tw Dear Professor Fox, Enclosed please find the final paper entitled "CRAFT : A Framework for F90/HPF Compiler Optimizations", co-authored by Jan-Jan Wu, Marina Chen and James Cowie, which is accepted for publication in Concurrency Practice and Experience. We are very sorry that we lost the paper number. The acceptance letter mentioned that minor revisions are required, but once revised, the paper can be published without further reviewing. We have revised the manuscript following the reviewers' suggestions, and have formatted the final paper following John Wiley ? Son's instructions. The following is a summary of what we have done for the revision and the uuencoded gzipped PostScript file for the final paper. Please let me know if you need any further information from me. Sincerely yours, Jan-Jan Wu ============PART 1: Summary of Revision ================== We have included a summary of contribution in the Conclusion section. We have also included much more motivation and details about the optimization algorithms, especially the communication algebra associated with the XFORM-1 optimizations (Section 2). More survey of recent work, including theses by Ghuloum and Roth, DIGITAL's HPF compiler, PGI's PGHPF compiler, and IBM's xlHPF compiler, are included (Section 5). In order to study the impact of the optimizations on today's machines, we have also hand-compiled some of the benchmark programs for the IBM SP-2 and a UltraSparc workstation cluster. Our experimental results demonstrated that our optimizations are also applicable to modern parallel machines (Section 4, page 26 - 30). The following is the list of revisions that have been included in the manuscript as suggested by the reviewers. ------------------ ? Page 5-6: What guarantees that g_1 is invertible? Most of the HPF alignment operators are reshape morphisms, which essentially are bijective functions defined over index domains. Bijective functions are invertible. For an operator f which is injective but not bijective, a reshape morphism f' can be derived from f by restricting the domain and range of f' to be the image of f. (page 6) ? Page 7: ? In example: "We need to reduce the length of \theta." What is the ? "length" of \theta? ? ? What is the justification that the amount of communication is ? determined by the number of operators, and how does this relate to ? the "owner computes" rule? ? ? How do you justify your order of simplifying communication ? expressions? ? ? It is extremely unclear what your communication algebras look ? like. To keep the paper self-contained, you should insert at least ? some representative rules of your algebra in this paper. In this paper, a communication expression \theta are extracted according to "owner computes" rule. \theta represents the amount of communication in straightforward implementation where the communication is carried out by executing the operators in the expression one by one from right to left. For example, the expression \theta in Figure 5 is the composition of five operators (i.e. of length five). Straightforward implementation will require two global address calculations and three data movements. Our goal is to reduce communication by reducing the number of operators (i.e. the length) in \theta. The reduction is guided by a set of algebraic rules to guarantee the soundness of the reduction as well as satisfying the goal of reducing communication. Details of the communication algebra and its associated properties is collected in a separate paper (reference [65]). To make this paper self-contained, we have included a brief introduction to the communication algebra and a set of algebraic rules (page 8 - 9). ? Page 17: In loop nest 6, why is the upper bound of the DO_T loop ? log p rather than log N? In loop nest 7, what is T in the upper ? bound of the DO_T loop in phase 2? As the referee correctly pointed out, the upper bound of the DO_T loop in Loop Nest 6 should be log N, and the T in the upper bound of the DO_T loop in phase 2 should be log N. ? Page 21: Procedure increase-granularity is referenced but never ? defined. It has been included in the revised manuscript (page 18). ? You should include a comparison of your work with Jerry Roth's ? recent PhD thesis from Rice, "Optimizing Fortran90D/HPF Programs ? for Distributed-Memory Computers". A comparison with this work has been included in the Related Work section (page 31). ?There is an overall question about the timeliness of this work. The ?target of these optimizations is the CM/2 and the CM/5. The paper ?needs to make the case that these analyses and transformations are ?applicable to today's machines. I'm not saying that the compiler needs ?to be retargetted but I'm suggesting, at least a discussion of how ?today's platforms differ from these targets and how we should view ?these optimizations in light of these differences. Even better would ?be Some timing results of some of these transformations (done by hand) ?on today machines. We agree with this referee's concern. In order to study the impact of the CRAFT optimizations on today machines, we have hand-compiled some of the benchmark programs for the IBM SP-2 and a UltraSparc workstation cluster. Our experimental results demonstrated that both machines can also benefit from these optimizations. Please see pages 26-30 of the manuscript for more detailed discussion. ?Section 3.5 addresses a really critical problem (how to organize ?this complex set of optimizations within the compiler). I think it ?deserves more attention. Maybe an example showing the complexity of ?the ordering decision. "...we have found the following linear ordering ?... to be sufficient..." isn't very satisfying. (There is some ?discussion here but it doesn't feel sufficient.) The focus of our current work is to investigate individual optimizing transformations. Our current strategies for deciding the ordering of transformations are somewhat ad hoc, however. We have started to study this problem. In the revised paper, we have made it clear that, in this paper, we will focus on the discussion of individual optimizing transformations. ?In the conclusion is the statement "Although most of the XFORM-2 ?optimizations are simple and familiar ... we feel that the major ?contribution of this work is that we have made an effort to identify ?appropriate contexts for applying these transformations and suggesting ?ways to compose them ..." One of the problems the paper suffers from ?is that it is not really clear what the main contribution is. It ?seems, at times, more like a snapshot/overview of what exists. If this ?is the main contribution, it would be better to organize the paper ?around this theme. and of course, to make it clear in the abstract, ?the intro. And then to make clear how section contributes to this claim. The contribution of XFORM-2 is a collection of loop-level optimizing transformations, which consists of a set of new optimizations (e.g. {\em block-cyclic permutation} and {\em interleaved reduction}) and a set of refined techniques that are adopted from existing work (e.g. {\em processor-memory skewing}, {\em increasing granularity}). In the revised manuscript, we have tried to make this clear (pages 15,16,35). ?The first paragraph of section 2 makes XFORM-1 sound like its mainly ?about interprocedural issues. But, as I understand it now, it operates ?on all movement. If this is true it should be made clearer in the ?first paragraph. Yes, XFORM-1 works for both intra- and interprocedural data movement. We have tried to made this clearer. ?Section 2.3. The communication algebra makes sense and specifies ?possible alternate expressions. But, in general, expressions are small ?so presumably there are only a few possible alternatives. Why are ?heuristics necessary? Why not simply find the best of the ?possibilities? It might be best to use heuristics but in this case it ?needs to be justified. The complexity of the algebraic simplification depends on the number of levels in nested procedure calls, whose values may range from small constants to larger numbers depending on the application programs. For efficiency of the compiler, it is desirable to optimize the execution time of the algebraic simplification procedure. This is why we think a heuristic algorithm is feasible. ? section 2.4 "...owner computes ... the amount of communication is ?determined by the number of operators in a communication ?expression... a communication expression is simplified if its operator ?count is decreased." hmmm. well the number of high level communication ?is *limited* by the number of operators. ?But some operators are on aligned data and don't require communication. A communication expression represents the amount of communication required in straightforward implementation where the communication is carried out by executing the operators in the expression one by one. The goal of the communication algebra is to reduce communication in a systematic way. The communication algebra reduce communication by minimizing the number of operators in a communication expression according to a set of algebraic rules (i.e. not arbitrarily). Operators that are on aligned data will be detected and their communication will be reduced to identity function (i.e. no communication required) during the simplication process. ?Also the amount of communication or more importantly the cost of the ?communication is not determined by the number of communications. This ?is an important distinction. That is right. For instance, the cost of two shift operations is likely to be lower than one transpose operation. The set of algebraic rules are carefully designed to make sure that only reducible operators get reduced, therefore it could never happen that an expression that can be minimized to two shift operations get 'reduced' to one transpose operation. Details about the communication algebra can be found in reference [65]. ?Section 3.2. In the subsection entitled "Adjusting Granularity of ?Pipelining": "the compiler will need to consider all possible ?permutation of parallel loop ordering in order to adjust the ?granularity of pipelining. However, we have found the following ?heuristic works well for many codes on several modern MPP systems." ?But the number of legal parallel loop orderings for a loop nest is ?typically very small. You need to defend the use of heuristics. And ?again the choice of this specific heuristic needs to be explained and ?defended. The heuristic is based on the observation that the unit communication vs. computation ratios on many modern parallel machines are around two orders of magnitude. This observation helps to prune the search space. The heuristic of first minimizing granularity of pipelining then increasing appropriate amount of granularity by strip mining works well for balancing the overhead between parallelism and communication. ?Section 3.2 "technically the transformation is simple and familiar ?(loop skewing ...)..." does this mean that the transformation ?presented here is identical to loop skewing. If so then that should be ?stated clearly. If it is not, then the differences should be made ?clear. We have modified this paragraph as: This optimization is an application of loop skewing technique in the context of partitioned iterative spatial loop nests. ?Section 3.4 "The transformations we describe here is more general and ?can parallelize reductions that are carried over outer loops in which ?loop-carried dependences may exist." How does this compare with: ? A. Fisher and A. Ghuloum. Parallelizing complex scan and reductions ? In {\em Proceedings of SIGPLAN '94, Conference on Programming ? Languages Design and Implementation}, June 1994. We have included a comparison with existing work (page 26) ?Section 4.1 " a CRAFT SIMD compiler for the CM-5 ..." is this a typo? ?the CM-5 is not a SIMD machine. (also sometime its written CM-5 and ?other times CM/5) CM-5 does support two modes, MIMD mode and SIMD mode. In SIMD mode, a global network is used to synchronize the execution of instructions. ?- related work: There is a reference to digitals MPP ?compiler. "... supports a subset of HPF standard alignment and ?distribution specifications." This is very old. Digital doesn't ?currently have an MPP product. They do have a significant HPF compiler ?however. It supports all the HPF alignment and distribution directives ?and performs significant communication optimizations. Some current ?references:... We have updated our bibliograph. A few more compilers are surveyed, including DIGITAL HPF by DEC, PGHPF by the Portland Group Inc., xHPF by APR, and xlHPF by IBM (Section 5). ? When are useser likely to use processor-memory skewing? ? It seems profitable only when using scalar code, but ? overall performance is much better when using vector units. Theoretically, processor-memory skewing should be profitable both for scalar code and vector code. We cannot explain why processor-memory skewing has negative impact on the CM-5 when using vector units (Both of the CRAFT optimized version and the CMSSL library routine underperform the CRAFT unoptimized version). Unfortunately we do not have access to other machines that have vector units, so we do not have other sets of timing data for comparison. However, our experimental results show that processor-memory skewing is profitable for scalar codes on a variety of machines, including the CM-5, IBM SP-2, and the UltraSparc workstation cluster. ?The CM-5 is a dead architecture. The authors should spend ? more time describing how these optimizations are likely to ? be used on something like an SP-2 or network of PCs. We have hand-compiled some of the benchmark programs for the IBM SP-2 and a UltraSparc workstatation cluster. The experimental results show that many of these optimizations are applicable to modern machines. Details are reported in Section 4, page 26-30. ? Do a better job of providing related work citations for each ? optimization, so the reader knows which are truly original ? and which are refinements of other efforts. Revised as suggested. ===============Part 2: the final paper, in uuencoded gzipped format ===== -------------------------cut here------------------------------------- Name: wu.ps.gz wu.ps.gz Type: Postscript Document (application/postscript) Encoding: x-uuencode