Referee 1 **************************************************************** This paper presents an algorithm for finding matrix block structure by exploring the matrix sparsity pattern. The proposed algorithm is interesting but the paper is not very well presented. Firstly, the paper did not present an example to show why it is important to take structure information when forming a pre-conditioner (which the paper has stated in the introduction section). It is important to present the example as it will illustrate the motivation behind the proposed work. Secondly, the paper did not discuss the cost of locating the "split point" in a matrix; Thirdly, the paper need more citations either to justify some of its claims or for further readings. Finally, in my opinion, the current state of the paper can only be a research note rather than a full paper. I would suggest the author to re-submit the paper for further evaluation. Referee 2 **************************************************************** E: Referee Comments (For Author and Editor) * The idea to use natural blocks of a matrix, without reordering, to guide an iterative procedure looks attractive and original; however, the experimentation section is disappointing. The comparison with an even split does not bring much because this is clearly a bad approach for the matrix tested. Section 1 says that using the natural blocks of the matrix is useful, but the experiments do not show that this can be better than allowing reordering of the matrix. Therefore in my opinion, the interest of the approach still needs to be proven. For that, the paper should include a part describing existing techniques that aim at defining blocks of a matrix for an iterative method, possibly allowing reordering and a comparison with those should be made. For example, wouldn't a general partitioner working on the adjacency graph of the (symmetrized) matrix provide better blocks at lower cost, possibly better than the splitting called "optimal" ? This would be interesting to study. * The algorithms proposed appear very costly but there might be some simple ways to decrease their costs. For instance in Figure 2, if row segment j is 0 for a given i_0, then a flag(i_0) could be set in order to avoid testing the smaller row segment j again for i > i_0. * The discussions on parallel aspects of the algorithms is not mature. The text should not insist on these aspects because parallel versions are not ready to be implemented/experimented. The focus should be on proving that the method works well, and possibly then on refining the sequential implementation. F: Presentation Changes - The presentation of the algorithms and tables could be improved. - Figures should be centered. - The description of the algorithms (sections 2 and 3) could be clearer. - It would be good to have a state of the art of existing techniques to define blocks and a conclusion showing in details the benefits brought by the new heuristic. More detailed comments follow (lines are for text, excluding tables and figures): p2, l24: "lose us anything" should be reworded. p2, l27: "compressed storage scheme" is not clear. If you mean compressed row storage then it makes sense to use the upper triangular part of the matrix. P2, l-2: "a zero element" -> "a nonzero element" (I think) p3, l1: "spit"->split p3, l6&7: Instead of running the algorithm of figure 2 in parallel for different values of p, couldn't some info from a previous pass with a larger band be reused ? p3, section 3: "or regular domains that have already been subjected to a Cuthill McKee ordering" There is an apparent contradiction. If such an ordering is applied, then as said in section 1, the natural block information from the application is lost. Following the logic, there is then no reason to use a method that does not allow reordering the matrix anymore. If on the other hand it is worth to combine reordering techniques with the natural detection of blocks proposed, then this should be experimented. p3, l-8 to -3: the paragraph is difficult to understand. It supposes that the matrix is distributed in some way but there is no information on this. P4, beg.: It would be useful to recall the mechanics of an alternating Schwarz preconditioner and how convergence may depend on the definition of the blocks and the connections between blocks. P4, l12: "one small of size" -> "a small one of size" P4, l13: "we simulated 8 processors throughout" is not clear. P4, l3: "The is" -> "The matrix is" P5&7, tables 1&2: Since the general splitting gives a better convergence than the "optimal" splitting in table 2, the term "optimal" could be replaced by "natural" or "regular". Referee 3 **************************************************************** E: Referee Comments (For Author and Editor) The manuscript describes two partitioning algorithms, one for regular and one for general matrices, to automatically determine natural block structures of the matrix. For parallel iterative solvers, this block structure can then be used for domain-decomposition based preconditioners. The algorithms are only sketched, not explained in detail. Some figures would help in sections 2 and 3. In section 4, results are presented for a simple matrix struture only. Moreover, iteration counts are given for merely two simple, relatively small matrices. It is more interesting to see how the algorithms behave for large, irregularly structured matrices. It is also important to see how they can deal with many processors (domains), let's say more than 100. To prove the effectiveness and the efficiency of the algorithms described, a lot of test cases (also from different applications) have to be examined. An alternative approach is to use graph partitioning software like METIS. With the multi-objective option, also the values of the matrix non-zeros can be considered. From my experience, this gives at least as good iteration counts compared with even splitting and the Jacobi preconditioner as described for the partitioning algorithms in the paper. METIS definitely can handle unstructured cases. Thus also a comparison with METIS partitioning or other (graph) partitioning software would be helpful. F: Presentation Changes Both the algorithmic and the result part have to be described in much more detail. For the algorithmic part, some figures would help a lot. In the result part, more test cases are necessary. A comparison with standard partitioning software would also be good. Furthermore, spelling has to be checked thoroughly. Referee 4 **************************************************************** This paper deals with algorithms for the automatic determination of structure in sparse matrices. Many - especially very large sparse matrices arise from the discretization of partial differential equations and therefore have an inherent structure which can be used in algorithms like domain decomposition. Once the matric has been constructed, the original structure is lost. This paper deals with algorithms which can reconstruct this information given only the matrix, but no information on where the matrix comes from. This is potentially an important and useful problem. Though a good interface between PDE and solver should probably pass the extra information (like grid structure, blocks, subdomains, etc) to the solver, in many realistic applications this information is not available and a black box solver must be used given only the matrix and no extra information. So the topic is promising, but unfortunately, the 6 page paper itself is quite disappointing. The algorithms are described poorly. Except a few code fragments the description of the algorithms remains informal and vague. I have not been able to understand what the algorithms do in detail. The hints at their parallelization are unintelligible. Even worse, there is hardly any attempt to discuss the properties of these algorithms. What is the complexity and how does it depend on the specific matrices considered? What role play the data structures for the matrices? Of course it would also be interesting to study the the quality of the automatically constructed partitions and compare them to those obtained manually for the same problem. At least for model problems, as they arise in typical 2D and 3D grid problems, the algorithmic complexity of the algorithms could be analyzed. In that case, the type of partitions could be compared to those from typical grid partitioning approaches. Such a more detailed analysis would indeed be interesting, but the current paper does too little in that direction. Referee 5 **************************************************************** E: Referee Comments (For Author and Editor) Many PDE problems have multiple variable or degrees of freedom at each grid point (say u,v, w,p) and taking advantage of such dense block structures in the sparse matrix is key to achieving good performance. In this paper, the block structure refers to subdomains decretized by a regular stencil on a logically rectangular grid. The author suggests some heuristic to identify such subdomains based on the symmetrized graph structure but does not use numerical entries of the matrix. The author might consider using matlab-like matrix notation to describe the intended algorithm, then separately address the issue of efficient implementation in a distributed memory parallel environment. The author might also specify what assumptions are made in the parallel implementation such say matrix is stored in compress block row format and the upper triangular part is distributed in block rows. The author might consider giving definitions of some terms such as "split point" or "beginnings/endings" of blocks. The impact of the automatic block repartition can be judged by the effectiveness of the preconditioner as given by number of iterations, and on the overall run time. The first can be determined even using MATLAB on a serial machine. The second issue has to take into account the effect of possible load imbalance or higher message volume. The author might also consider expanding the numerical experiments on more test cases perhaps obtained from the Fortran Market or the Harwell Boeing library. The author might comment on the class of problem (say finite difference on elliptic problem on a "brick" domain) that this heuristic might be useful. The example problem used is based on a two-material problem (heat conduction?) with large differences in material properties. The discretization used for each material seems to product different sparsity pattern. If the same discretization technique were used on a similar grid (say both are logically rectangular grid with same grid dimensions), then the heuristic on automatic block detection that is based only on the graph structure would fail to identify the two sub-domains. F: Presentation Changes The author might consider adding a figure to describe the two-material problem used to generate the sparse matrices. The author might help the reader by adding in the caption what the main point or what the reader should find or focus on each figure.