Referee 1 ****************************************************************

This paper presents an algorithm for finding matrix block structure by exploring the matrix
sparsity pattern.

The proposed algorithm is interesting but the paper is not very well presented. Firstly,
the paper did not present an example to show why it is important to take structure
information when forming a pre-conditioner (which the paper has stated in the introduction
section).

It is important to present the example as it will illustrate the motivation behind the
proposed work. Secondly, the paper did not discuss the cost of locating the "split point"
in a matrix; Thirdly, the paper need more citations either to justify some of its claims or
for further readings. Finally, in my opinion, the current state of the paper can only be a
research note rather than a full paper.

I would suggest the author to re-submit the paper for further evaluation.


Referee 2 ****************************************************************

E: Referee Comments (For Author and Editor)

* The idea to use natural blocks of a matrix, without reordering, to
guide an iterative procedure looks attractive and original; however,
the experimentation section is disappointing. The comparison with
an even split does not bring much because this is clearly a bad
approach for the matrix tested. Section 1 says that using the
natural blocks of the matrix is useful, but the experiments do not
show that this can be better than allowing reordering of the matrix.
Therefore in my opinion, the interest of the approach still needs
to be proven. For that, the paper should include a part describing
existing techniques that aim at defining blocks of a matrix for an
iterative method, possibly allowing reordering and a comparison
with those should be made. For example, wouldn't a general
partitioner working on the adjacency graph of the (symmetrized)
matrix provide better blocks at lower cost, possibly  better than
the splitting called "optimal" ? This would be interesting to
study.

* The algorithms proposed appear very costly but there might be some
simple ways to decrease their costs. For instance in Figure 2, if
row segment j is 0 for a given i_0, then a flag(i_0) could be set in
order to avoid testing the smaller row segment j again for i > i_0.

* The discussions on parallel aspects of the algorithms is not mature.
The text should not insist on these aspects because parallel versions
are not ready to be implemented/experimented. The focus should be
on proving that the method works well, and possibly then on refining
the sequential implementation.

F: Presentation Changes

- The presentation of the algorithms and tables could be improved.
- Figures should be centered.
- The description of the algorithms (sections 2 and 3) could be clearer.

- It would be good to have a state of the art of existing techniques to
define blocks and a conclusion showing in details the benefits brought
by the new heuristic.

More detailed comments follow (lines are for text, excluding tables and
figures):

p2, l24:  "lose us anything" should be reworded.
p2, l27:  "compressed storage scheme" is not clear.
          If you mean compressed row storage then it makes sense to
          use the upper triangular part of the matrix.
P2, l-2:  "a zero element" -> "a nonzero element" (I think)
p3, l1:   "spit"->split
p3, l6&7: Instead of running the algorithm of figure 2 in parallel for
         different values of p, couldn't some info from a previous pass
with
         a larger band be reused ?

p3, section 3: "or regular domains that have already been subjected to a
 Cuthill McKee ordering"  There is an apparent contradiction. If
such an ordering is applied, then as said in section 1, the natural
block information from the application is lost. Following the logic,
there is then no reason to use a method that does not allow reordering
the matrix anymore. If on the other hand it is worth to combine
reordering techniques with the natural detection of blocks proposed, then
this should be experimented.

p3, l-8 to -3: the paragraph is difficult to understand. It supposes
that the matrix is distributed in some way but there is no information
on this.

P4, beg.: It would be useful to recall the mechanics of an alternating
         Schwarz preconditioner and how convergence may depend on the
         definition of the blocks and the connections between blocks.

P4, l12: "one small of size" -> "a small one of size"

P4, l13: "we simulated 8 processors throughout" is not clear.

P4, l3:  "The is" -> "The matrix is"

P5&7, tables 1&2: Since the general splitting gives a better convergence

           than the "optimal" splitting in table 2, the term "optimal"
could be replaced by "natural" or "regular".


Referee 3 ****************************************************************

E: Referee Comments (For Author and Editor)

The manuscript describes two partitioning algorithms, one for regular
and one for general matrices, to automatically determine natural block
structures of the matrix. For parallel iterative solvers, this block
structure can then be used for domain-decomposition based
preconditioners.

The algorithms are only sketched, not explained in detail. Some figures
would help in sections 2 and 3.

In section 4, results are presented for a simple matrix struture only.
Moreover, iteration counts are given for merely two simple, relatively
small matrices.

It is more interesting to see how the algorithms behave for large,
irregularly structured matrices. It is also important to see how they
can deal with many processors (domains), let's say more than 100.

To prove the effectiveness and the efficiency of the algorithms described,
a lot of test cases (also from different applications) have to be
examined.

An alternative approach is to use graph partitioning software like
METIS. With the multi-objective option, also the values of the
matrix non-zeros can be considered. From my experience, this gives at least
as good iteration counts compared with even splitting and the
Jacobi preconditioner as described for the partitioning algorithms in the
paper. METIS definitely can handle unstructured cases. Thus also
a comparison with METIS partitioning or other (graph) partitioning
software would be helpful.


F: Presentation Changes

Both the algorithmic and the result part have to be described in much
more detail. For the algorithmic part, some figures would help a lot.
In the result part, more test cases are necessary. A comparison with
standard partitioning software would also be good.

Furthermore, spelling has to be checked thoroughly.


Referee 4 ****************************************************************

This paper deals with algorithms for the automatic determination
of structure in sparse matrices. Many - especially very large sparse
matrices arise from the discretization of partial differential equations
and therefore have an inherent structure which can be used in algorithms
like domain decomposition. Once the matric has been constructed, the
original structure is lost. This paper deals with algorithms which can
reconstruct this information given only the matrix, but no information
on where the matrix comes from.

This is potentially an important and useful problem. Though a
good interface between PDE and solver should probably pass the extra
information (like grid structure, blocks, subdomains, etc) to the  solver,
in many realistic applications this information is not available and
a black box solver must be used given only the matrix and no extra
information.

So the topic is promising, but unfortunately, the 6 page paper itself
is quite disappointing.  The algorithms are described poorly. Except a
few code fragments the description of the algorithms remains informal
and vague. I have not been able to understand what the algorithms do in
detail. The hints at their parallelization are unintelligible.

Even worse, there is hardly any attempt to discuss the properties of these
algorithms.  What is the complexity and how does it depend on the specific
matrices considered? What role play the data structures for the matrices?

Of course it would also be interesting to study the the quality of the
automatically constructed partitions and compare them to those obtained
manually for the same problem.

At least for model problems, as they arise in typical 2D and 3D
grid problems, the algorithmic complexity of the algorithms could be
analyzed. In that case, the type of partitions could be compared to those
from typical grid partitioning approaches. Such a more detailed analysis
would indeed be interesting, but the current paper does too little in that
direction.

Referee 5 ****************************************************************

E: Referee Comments (For Author and Editor) 

Many PDE problems have multiple variable or degrees of freedom at each grid point (say u,v,
w,p) and taking advantage of such dense block structures in the sparse matrix is key to
achieving good performance. In this paper, the block structure refers to subdomains
decretized by a regular stencil on a logically rectangular grid.  The author suggests some
heuristic to identify such subdomains based on the symmetrized graph structure but does not
use numerical entries of the matrix.

The author might consider using matlab-like matrix notation to describe the intended
algorithm, then separately address the issue of efficient implementation in a distributed
memory parallel environment. The author might also specify what assumptions are made in the
parallel implementation such say matrix is stored in compress block row format and the
upper triangular part is distributed in  block rows.

The author might consider giving definitions of some terms such as "split point" or
"beginnings/endings" of blocks.

The impact of the automatic block repartition can be judged by the effectiveness of the
preconditioner as given by number of iterations, and on the overall run time. The first can
be determined even using MATLAB on a serial machine. The second issue has to take into
account the effect of possible load imbalance or higher message volume. The author might
also consider expanding the numerical experiments on more test cases perhaps obtained from
the Fortran Market or the Harwell Boeing library. The author might comment on the class of
problem (say finite difference on elliptic problem on a "brick" domain) that this heuristic
might be useful.

The  example problem used is based on a two-material problem (heat conduction?) with large
differences in material properties. The discretization used for each material seems to
product different sparsity pattern. If the same discretization technique were used on a
similar grid (say both are logically rectangular grid with same grid dimensions), then the
heuristic on automatic block detection that is based only on the graph structure would fail
to identify the two sub-domains.
  
  
  
  
  
  

F: Presentation Changes 

The author might consider adding a figure to describe the two-material problem used to
generate the sparse matrices.

The author might help the reader by adding  in the caption what the main point or what the
reader should find or focus on each figure.