The reviewer found himself disagreeing with many of the authors' claims and their review of prior work. From a conceptional or theoretical point of view the authors' thesis is well founded. It is their translation of that framework into practice that I find lacking or misleading. To put this statement another way: Vendors in the past and today are using alternatives or the same techniques in producing BLAS DGEMM. I believe that if the article is published without revision, it will mislead and/or misrepresent this subject matter to the SISC readership. On a more positive note, their work on reducing the complexity of applying recursive Morton ordering to the iterative algorithm is novel and of interest. Also, their improvements in extending the applicability and efficiency of Morton ordering to arbitrary-sized matrices is novel and of interest. In the rest of this report I will detail why I believe the above general remarks to be true. Note the references using numbers refer to the authors' bibliography while references using alphabetic letters refer to references supplied by the reviewer. Abstract: The first sentence is NOT true on some processors; e.g., IBM power 1, 2 and 3 architectures. For DGEMM, performance in excess of 90% peak is obtained; see [F]. The Algorithms and Architecture approach is used [F] which is different from the authors. This approach is straightforward, structured, and highly efficient. The performance of the authors' codes are below 80%, 75%, 74% of the peak on the three platforms they measure performance on. The abstract states these performance levels significantly outperform other libraries (they mention ATLAS) and other competing methodologies. Yet on Pentium III ATLAS achieves 78%. Note 78 > 74. Later on they claim advances such as "oscillating iterative algorithm" and a "variable recursion cut-off criterion" for Strassen. This reviewer knew about versions of both these advances in the early 1980s. The Strassen advance was used by ESSL in its 1987 release. This reviewer does agree with a need for kernel interface. I. Introduction - "Surprisingly, . . . " is too sweeping a statement. For example, if a vendor can uniformly achieve 90% or more of the peak performance in a simple straightforward manner for a particular processor, then additional or new techniques have marginal appeal. Their remarks about Strassen are relative. The authors do NOT compare their results against the peak rate of the processor. Rarely do their results for Strassen exceed the peak rate of the processor and in cases where this occurs, the matrix size is quite large; i.e., the compounding effect of Strassen should also be quite large. Section 1.1 - The authors fail to mention ESSL, first released in February 1986, and the work of the Cedar project at the University of Illinois (circa 1985) in their references [1 - 5]. There are a series of papers by the ESSL group, e.g., [A, B, C, D, E, F, G, J, K, M, N, T, U, V, W, X]. In regard to alternate storage formats, the authors do not mention new formats sent to the BLAS forum during September 1999 by Fred Gustavbson. These formats are a generalization of both of the traditional formats: packed and full. These formats have appeared in [L, O, P, Q, R, S, T, V, X, AA]. The second paragraph of page two speaks about the importance of the architecture component. There are no references. See [F] for a description of the Algorithms and Architecture approach. Paragraph 3 of page two mentions PHiPAC and ATLAS. The idea for PHiPAC came about when Jim Demmel challenged his student to equal the performance of the ESSL library. Also, many of the coding techniques used by ATLAS are identical to ESSL's coding of its kernels, including the important on-chip DGEMM kernel. The numerical stability of Strassen algorithm is sometimes better than the high school algorithm which was a surprise to this reviewer. Section 1.2 - This reviewer thinks the authors' claims are extravagant and misleading. Claim 1. Other authors have done the same thing. Claim 2. I don't believe these claims. I think the authors' implementations of other author's work is either incorrect or inefficient and hence makes these claims suspect. Claim 3. I would agree with this claim. However, I am skeptical about the need to code using Morton ordering. Claim 4. I agree. Other authors, however, have made this claim. Claim 5. I was unaware of this "conventional wisdom." I have also found this claim to be true without the use of Morton ordering. Claim 6. I know of Strassen code that greatly exceeds the performance of the code developed by the authors; see [H, I]. Yet Claim 6 states the opposite. Claim 7. This claim speaks about new results: (i) oscillating iterative technique is a known technique. (ii) use of standard 2D ordering requires no index checking. (iii) IBM ESSL Strassen code had smooth and very high performance. Claim 8. This claim expresses an opinion which this reviewer agrees with. However, other researchers do not agree with this need. Claim 9. Again, this claim expresses an opinion that the authors believe the paper's results demonstrate. It is not clear to me that this claim has been justified. Section 2 - Paragraph 1, Line 8, suggest used instead of made use. Paragraph 2 - It is not clear to this reviewer that there isn't a relationship between CPU and memory. If there is, then treating them independently could lead to less optional performance. Paragraph 3, last sentence - I don't think the paper demonstrates the claim made by this sentence. For example, this paper is about DGEMM and doesn't explicitly address matrix factorization. Paragraph 4, Sentence 3 - These comparisons are made with codes provided by the authors. It is this reviewer's belief that the authors have produced either incorrect or inefficient implementations of other author's codes/algorithms. Section 3: Paragraph 1, Line 1 - The reference list is incomplete; e.g., [F, L, O, P, Q, S, T, W, X, Y, AA] are missing. Paragraph 2 - It is not clear to this reviewer that Morton ordering should be used. If there are NS submatrices, each stored in row/col order, then a table of size NS can be used instead. Paragraph 3, last sentence - The authors describe a new result that overcomes previous difficulties with Morton ordering. This is a strong point of the paper. Section 3.1, Paragraph 1 - Using a table of size NS does not have the diffulties alluded to. Paragraph 2. See the remark in Paragraph 1. Paragraph 3, Sentence 1 - Newer formats, see [AA], called block hybrid formats do not require padding. Some of them don't even require tables. Sentence 2 - Only row/col ordering is used, along with tables. Sentence 3 - The use of a binary tree was chosen for ease of programming. Overhead is not an issue because recursion is combined with blocking. For examples, see [L, page 749] and [X]. Sentence 4 - This statement is misleading. The reviewer believes it would be easy to incorporate the two-miss characteristic into Gustavson's formulation. Sentence 5 - This reviewer thinks the results of Section 7.2 misrepresents Gustavson's methods. Sentence 6 - This reviewer thinks the statement is misleading. Clearly, with tables, the submatrices can be referenced in any order. Section 3.2 - Here new work is being described about storing the NS submatrices. The reviewer liked the results. Paragraph 7 - Usually the size of Ll cache is large enough so that NS is small. In that case tables can be used with neglicable overhead. Paragraph 9 - There are no problems with the use of tables except for extra bookkeeping in the programming. Sections 3.3 and 3.4 - Both of these sections contain interesting results related to the handling of odd-size matrices. Section 4 - The authors should mention issues related in going to and from row/col format. This referee predicts that any format change is NOT likely to be adopted. At the very least, the authors should mention that their data is initially in the new format and that any cost to and from standard format is not included. Section 4.1, Sentence 1 - The referee objects to the word must. As a programming maxim, it is good to choose a data structure that matches an algorithm. For example, in [12] this maxim is followed and positive results are obtained using it. Sentence 2 - In [L], results showing the same thing for Cholesky and dense LU are reported but in this case, the maxim is not followed. Paragraph 2 - The new result about incremental location code shows a reduced overhead. However, on some platforms the cost of standard row/col addressing is free (zero cost). The point here is that standard addressing is optimized in the hardware architecture. Section 4.2, Line 2 - Insert of after block. Paragraphs 2 and 3 - Positive features of the authors' methods are mentioned which some other algorithms don't possess. However, there are still other algorithms, differerent from the authors, that have the same positive features; e.g., use of tables or storing the submatrices in standard row/col order. Section 4.3 - When it helps, the use of the two-miss feature is important. However, prefetching in the kernel tends to mitigate the usefulness of this feature. Section 4.4 - Last sentence of Paragraph 1. It may interest the authors to know that, for matrices with uniform random values in the range from zero to one, Strassen algorithm has better accuracy than the standard algorithm. This fact was used successfully in ESSL. Paragraph 2 - Strassen Algorithm was used in [H, I]. Excellent performance results were obtained. For example, for N = 5000 and using 64 processors, Strassen gave 271 MFLOP's per processor where the peak rate of each processor was 266 MFLOP's. Here each process had a matrix size of 1250. In contrast, the authors' code reaches the peak rate at n=3000 on the 10000 and n = 5000 on the ALPHA. Paragraph 3 - Many of the algorithmic ideas used by Huss-Lederman et al. were previously used by ESSL in 1986. Paragraph 3 - The authors might mention that apparent MFLOP's is current practice in the literature. Section 4.4.1, Paragraph 1 - This type of algorithm was used by ESSL for RISC processors. On vector machines, other ideas were adopted to take advantage of using full vector length. Paragraph 2 - A bordering technique was used in ESSL. Paragraph 3 - This approach was used for RISC processors in ESSL. Paragraph 4 - Other approaches that handle this problem also exist. Section 5, Paragraph 1, Sentence 2 - This referee thinks this aspect is more important. Paragraph 2 - This referee agrees with the authors. However, lobbying for a new interface is a thankless task. But, I hope the authors succeed. Paragraph 4 - This referee thinks that optimized level 3 kernels are the correct building blocks. Paragraph 5 - ESSL RISC kernels don't use assembler. Algorithmic prefetching is used by ESSL; see [E, F]. There is also the concept of level 3 prefetching which is necessary on some RISC processors. Paragraph 6 - The first sentence is misleading in that it gives the impression that this is being done for the first time. Also, in Paragraph 5, the authors indicate "a proof-of-concept." This sentence indicates a new major contribution of the research. Also, claimed is these ideas are only used in isolation. This referee knows these ideas are used in totality by ESSL where appropriate. Section 6, Paragraph 1 - Figure 6.1 is too high level. The top and bottom boxes constitute the entire "action" boxes for some other competiting approaches; e.g. see [F] and [AD]. The contribution of this paper by the authors' approach is to state one of four competiting algorithms that should be used in the middle area of Figure 6.1. Paragraph 2, Line 2 - suggest more poorly or poorer instead of poorly. Paragraph 2 - The Algorithms and Architecture approach; see [F] can be viewed as a polyalgorithmic treatment. Paragraph 3 - Another approach, not mentioned, is the FLAME approach; see [AB]. Paragraph 4 - Here conversion between various formats is mentioned for the first time. Sentences 3 and 4 - The authors use the results of Chatterjee et al. to conclude the cost is minimal, say 2-5%. This referee tends to disagree. In [L, page 739], it is pointed out that level 3 BLAS are called multiple times for factorization algorithms. Sentence 5 takes a more balanced approach. Sentence 6 Reference [L, page 739] makes this point. Paragraph 5, Sentence 1 - It is perhaps better that libraries take this approach; see [X], where this approach has already been successfully taken for Cholesky factorization. Paragraph 6, Sentence 1 - Others have reached the same conclusion. ESSL, for example, has based its BLAS and factorization algorithms on kernel routine and routines for the top level block of Fig. 6.1. Paragraph 6, Line -1. Suggest replacing with by resulting in. Section 7 - Suggest putting peak MFLOP's rates for each processor; e.g., 390, 1000, and 550 MFLOP's respectively. Section 7.1 - A semi-log plot also reveals significant information about performance. The reviewer wonders about performance for matrix sizes in the range from say, 10 to 100. For example, ESSL, see [F], obtained over 200 MFLOP's (out of 266 MFLOP's peak) for an order 20 matrix on POWER 2. For n m 40 the result was 240 MFLOP's which is better than 90% of peak. At n m 50 the result was at 95% of peak. Figures 7.1 appear to start at n = 50 while Figures 7.2 appear to start at n = 300 and n = 100 respectively. Clearly, the performance doesn't approach the peak rate of these processors. Section 7.1, Paragraph 4, Sentence 2 - ESSL, see [F, pages 569 and 570], takes precaution to avoid so-called bad LDA's. Hence, the statement in Sentence 3 is misleading. Nonetheless, the reviewer agrees with the authors about using alternate date formats; e.g., see [X] Figure 22. Paragraph 5- Other matrix multiply algorithms use the polyalgorithm approach; e.g., see [F], and more recently [O, AD]. Section 7.2 - This reviewer thinks the results of this section are misleading. For example, take Pentium III. The reviewer has measure ATLAS performance at n = 1000 for a 600 MHZ Pentium III at 462 MFLOP's. Now 11/12 of 462 is 424 and Figure 7.4 lists ATLAS' performance at slightly below 400 MFLOP's. Also, the results of Chatterjee and Gustavson appear too low. For example, Gustavson, et al. methods [12] transverses a binary tree calling kernel routines at the leaves. The kernel routines familiar to this referee perform near the peak rate of the processor. Figure 7.3 suggests the Challenge 10000 kernel for Gustavson et al. [12] algorithm runs at about 300/390 = 77% of peak. Yet, the authors show full matrix multiply running near 320 MFLOP's or 81% of peak. This referee will not comment on the Alpha part of Figure 7.4. For Pentium III, the ITXGEMM kernel runs a about 460 MFLOP's for a 550 MHZ model. Yet the figure 7.4 says Gustavson's methods [12] runs at about 325 MFLOP's. Hence, this reviewer does not believe the authors' conclusions. Let me suggest a possible explanation. Sentence 3 says native BLAS or ATLAS were used. I believe ATLAS always uses data copying. I don't know what native BLAS means. I assume it means vendor supplied BLAS? I hope the authors did not use the BLAS supplied with netlib? So, an algorithm, such as Gustavson's that was constructed to only call an optimized kernel routine, should NOT be doing data copying to put an operand in a format that it is already in. This extra overhead would possibly explain the author's performance results in regard to, say, Gustavson et al. method. Paragraph 2, Sentence 7 - I don't think significantly is justified in describing ATLAS' performance. For SGI, Figure 2.3 performance ranges from slightly over 275 to slightly over 300 MFLOP's. For ALPHA it ranges between 650 and 750 MFLOP's up to size 4000. The results after size 4000 are probably due to ATLAS' inability to allocate buffers to block for higher levels of memory. For Pentium III, ATLAS' performance is nearly level up to size 4500. Paragraph 3 - Gustavson's method, mentioned above, was meant only to call kernel routines. This referee suggests that the "fat interface" is only a small part of the problem. He thinks it is mainly a data copying problem. He notes that some vendor supplied BLAS's, e.g., ESSL, only does data copy when there can be a cache conflict problem or when the cost of data copy is minimal; see [F, pages 569, 570]. Section 7.3, Sentence 2 - The sentence is misleading. The performance does not approach the peak rate of the processors studied. Hence reasons should be supplied why any other method can't reach past the performance achieved by the authors. Sentence 3 - It may be possible for other algoroithms using traditional formats to achieve the same or better performance? Sentence 5 - This referee does not believe this conclusion; see [S, Z] for counter examples. Sentence 6 - It appears that Chatterjee's method could adopt some of the authors' results. Section 8, Sentence 3 - The referee does NOT agree with this claim. Sentence 4 - I don't believe ATLAS uses prefetching in their kernel design. Paragraph 2, Sentence 3 - This sentence is misleading. As mentioned earlier, matching an algorithm to its data structure and vice-versa is a good maxim. However, there may be exceptions. So this reviewer agrees that it is not essential. Sentence 4 - I believe this result is already known. Sentences 5 and 6 - These statements have the "ring of truth." However, it is not clear which "polyalgorithm" is best. The referee thinks other "polyalgorithms" have unifying themes; e.g., see ESSL and [AD}. Paragraph 3 - As mentioned, I don't agree with this conclusion. Level 3 BLAS were designed and advocated by dense linear algebra numerical analysts for improving performance and extending functionality of then established libraries such as LINPACK and EISPACK. A flaw in this design was pointed out in [F, page 739]. More recently, see [S, Z]. Paragraph 4 - The references just cited along with [L, P, R, T, V, X, AA] have already investigated some of the future work mentioned in this paragraph. Paragraph 4, Sentence 3 - This area also contains recent research; see [V, W, AB, AD]. REFERENCES [A] R. C. Agarwal, F. G. Gustavson, "A Parallel Implementation of Matrix Multiplication and LU Factorization on IBM 3090, " Book Chapter, in "Aspects of Computation on Asynchronous Parallel Processors," Margaret Wright, Editor, North Holland, 1989, pp. 217-221. [B] R. C. Agarwal, F. G. Gustavson, J. McComb, S. Schmidt, "Engineering and Scientific Subroutine Library Release 3 for IBM ES/3090 Vector Multiprocessors," IBM Systems Journal, Vol. 28, No. 2, 1989, pp. 345-350. [C] R. C. Agarwal, F. G. Gustavson, "Vector and Parallel Algorithms for Cholesky Factorization on IBM 3090," in the Proceedings of Supercomputing '89, November 13-17, 1989, Reno, Nevada, pp. 225-233. [D] R. C. Agarwal, F. G. Gustavson, M. Zubair, "A High Performance Algorithm Using Pre-Processing for the Sparse Matrix-Vector Multiplication," in the Proceedings of Supercomputing '92, November 16-20, 1992, Minneapolis, Minn., pp. 32-41. [E] R. C. Agarwal, F. G. Gustavson, M. Zubair, "Improving Performance of Linear Algebra Algorithms for Dense Matrices Using Algorithmic Prefetch," IBM Journal of Research and Development, Vol. 38, May 1994, pp. 265-276. [F] R. C. Agarwal, F. G. Gustavson, M. Zubair, "Exploiting Functional Parallelism of POWER2 to Design High Performance Numerical Algorithms," IBM Journal of Research and Development, Vol. 38, September 1994, pp. 563-576. [G] R. C. Agarwal, F. G. Gustavson, M. Zubair, "A High Performance Matrix-Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Overlapped Communication," IBM Journal of Research and Development, Vol. 38, November 1994, pp. 673-682. [H] R. C. Agarwal, F. G. Gustavson, S. M. Balle, M. Joshi, P. Palkar, "A High Performance Matrix Multiplication Algorithm for MPPs," Applied Parallel Computing, Second International Workshop, PARA'95, Lyngby, Denmark, August 21-24, 1995, Proceedings, Jack Dongarra, Kaj Madsen, Jerzy Wasniewski, Editors, pp. 1-8, in Lecture Notes in Computer Science 1041. [I] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, P. Palkar, "A Three-Dimensional Approach to Parallel Matrix Multiplication," IBM Journal of Research and Development, Vol. 39, No. 5, September, 1995. [J] F. G. Gustavson, A. Gupta, "A New Parallel Algorithm for Tridiagonal Symmetric Positive Definite Systems of Equations," Applied Parallel Computing, Industrial Computation and Optimization, Third International Workshop, Proceedings of PARA'96, August 18-21, 1996, Jerzy Wasniewski, Jack Dongarra, Kaj Madsen, Dorte Olesen, Editors, pp. 341-349, in Lecture Notes in Computer Science 1184, September 1996. [K] R. C. Agarwal, F. G. Gustavson, M. Zubair, "Performance Tuning on IBM RS/6000 POWER2 Systems," Proceedings of PARA96, August 1996. [L] F. G. Gustavson, "Recursion Leads to Automatic Variable Blocking for Dense Linear Algebra Algorithms," IBM Journal of Research and Development, Vol. 41, No. 6, November 1997, pp. 737-755. [M] A. Gupta, F. G. Gustavson, M. Joshi, S. Toledo, "The Design, Implementation, and Evaluation of a Banded Linear Solver for Distributed Memory Parallel Computers," ACM TOMS, March 1998. [N] E. Elmroth, F. Gustavson, "New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems," Applied Parallel Computing, Large Scale Scientific and Industrial Problems, Proceedings of 4th International Workshop, PARA '98, Umeå, Sweden, June 14-17, 1998, pp. 120-128, Bo Kågström, Jack Dongarra, Erik Elmroth, Jerzy Wasniewski, Editors, Lecture Notes in Computer Science 1541. [O] F. Gustavson, A. Henriksson, I. Jonsson, B. Kågström, P. Ling, "Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms," Applied Parallel Computing, Large Scale Scientific and Industrial Problems, Proceedings of 4th International Workshop, PARA '98, Umeå, Sweden, June 14-17, 1998, pp. 195-206, Bo Kågström, Jack Dongarra, Erik Elmroth, Jerzy Wasniewski, Editors, Lectures Notes in Computer Science 1541. [P] F. Gustavson, A. Henriksson, I. Jonsson, B. Kågström, P. Ling, "Superscaler GEMM-based Level 3 BLAS -- The On-going Evolution of a Portable and High-Performance Library," Applied Parallel Computing, Large Scale Scientific and Industrial Problems, Proceedings of 4th International Workshop, PARA '98, Umeå, Sweden, June 14-17, 1998, pp. 207-215, Bo Kågström, Jack Dongarra, Erik Elmroth, Jerzy Wasniewski, Editors, Lectures Notes in Computer Science 1541. [Q] J. Wasniewski, B. S. Andersen, F. Gustavson, "Recursive Formulation of Cholesky Algorithm in Fortran 90," Applied Parallel Computing, Large Scale Scientific and Industrial Problems, Proceedings of 4th International Workshop, PARA '98, Umeå, Sweden, June 14-17, 1998, pp. 574-578, Bo Kågström, Jack Dongarra, Erik Elmroth, Jerzy Wasniewski, Editors, Lectures Notes in Computer Science 1541. [R] B. S. Andersen, J. Wasniewski, F. Gustavson, A. Karaivanov, P. Y. Yalamov, "LAWRA: Linear Algebra with Recursive Algorithms," Proceedings of the Third International Conference on Parallel Processing and Applied Mathematics, Kazimierz Dolny, Poland, September 14-17, 1999, R. Wyrzykowski, B. Mochnacki, H. Piech, J. Szopa, Editors, pp. 63-76. [S] F. G. Gustavson, "New Generalized Data Structures for Matrices Lead to a Variety of High Performance Dense Linear Algebra Algorithms," November 15, 1999, Proceedings of PDC Annual Conference, Simulation and Visualization on the Grid, B. Engquist, L. Johnsson, M. Hammill, and F. Short, Editors, Springer, New York, pp. 46-61, December 1999. [T] Fred Gustavson, Alexander Karaivanov, Minka Marinova, Jerzy Wasniewski, Plamen Yalamov, "A Fast Minimal Storage Symmetric Indefinite Solver," in Applied Parallel Computing, 5th International Workshop, PARA 2000, Lecture Notes in Computer Science, Bergen, Norway, June 19-21, 2000, 10 pages. [U] Fred Gustavson and Isak Jonsson, "High Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage," in Applied Parallel Computing, 5th International Workshop, PARA 2000, Lecture Notes in Computer Science, Bergen, Norway, June 19-21, 2000, 10 pages. [V] Erik Elmroth and Fred Gustavson, "High-Performance Library Software for QR Factorization," in Applied Parallel Computing, 5th International Workshop, PARA 2000, Lecture Notes in Computer Science, Bergen, Norway, June 19-21, 2000, 11 pages. [W] E. Elmroth, F. G. Gustavson, "Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance," IBM J. Res. Develop., Vol. 44, No. 4, July 2000, pp. 605-624. [X] Fred Gustavson and Isak Jonsson, "Minimal Storage High Performance Cholesky Factorization Via Blocking and Recursion," IBM J. Res. Develop., Vol. 44, No. 6, November 2000, pp. 823. [Y] Erik Elmroth and Fred G. Gustavson, "A New Much Faster and Simpler Algorithm for LAPACK DGELS," accepted by BIT, September 2000, 12 pages. [Z] Fred G. Gustavson, "New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms," in Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, Ottawa, Canada, 2-4 October, 2000, Ronald F. Boisvert and Ping Tak Peter Tang, Editors, Kluwer, Boston, 14 pages. [AA] B. S. Andersen, F. Gustavson, and J. Wasniewski, "A Recursive Formulation of the Cholesky Factorization Operating on a Matrix in Packed Storage Form," accepted for publication in TOMS, January 2001, 34 pages. [AB] J. A. Gunnels, G. M. Henry, R. A. van de Geijn, "FLAME: Formal Linear Algebra Methods Environment," and in Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, Ottawa, Canada, 2-4 October, 2000, Ronald F. Boisvert and Ping Tak Peter Tang, Editors, Kluwer, Boston, 14 pages, unpublished; see http://www.cs.utexas.edu/users/flame/pubs.html. [AC] Fred G. Gustavson, "New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms," in Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, Ottawa, Canada, 2-4 October, 2000, Ronald F. Boisvert and Ping Tak Peter Tang, Editors, Kluwer, Boston, 14 pages, unpublished; see http://www.cs.utexas.edu/users/flame/pubs.html. [AD] ITXGEMM; see http://www.cs.utexas.edu/users/flame/pubs.html. [AE] Robert van de Geijn, "Using PLAPACK: Parallel Linear Algebra Package," book, The MIT Press, 1997.