I will draw some sketches as below. I hope you can understand them.

1. Model for computation and I/O

          parallel compute node (upper applications/client)

 _____		 ______			______		_______
|cn1 |           | cn2 |               | cn3 |          | cn4 |
|____|           |_____|               |_____|          |_____| 
      \                 \             /                 /
       \                 \           /                 /
    |          my  I/O base (consists of i/o nodes)         |
          /              /            \                \   
         /              /              \                \  
 ______		________ 		_______		_______
 |disk1|        | disk2 |               | disk3|        |disk4 |
 |_____|        |_______|               |______|        |______|

(Note: I just enumerate four compute nodes or disks arbitrarily )

2. Mapping global matrix data structures for scattered disks.

(1) Assume that I need to maintain a matrix A(4x4) as below. In fact, it 
can be matrix  A(mxn) 

A(4x4) =  [  a11   a12   a13   a14 ]
          [  a21   a22   a23   a24 ]
          [  a31   a32   a33   a34 ]
          [  a41   a42   a43   a44 ]

(2) Appearance of logic file (linear -- row-major for example)

|  a11 a12 a13 a14 a21 a22 a23 a24 a31 a32 a33 a34 a41 a42 a43 a44 |

(3) Normal method of partitioning the above global data structures between 
local disks.

      disk1                disk2               disk3                disk4
 _________________   _________________   _________________    
| a11 a12 a13 a14 | | a21 a22 a23 a24 | | a31 a32 a33 a34 |  | a41 a42 a43 
a44 |
|_________________| |_________________| |_________________|  

(4) Potential disadvantages of above normal method (partition according to 
  For instance, compute node (application) require the matrix factors in  
column, while the
  data are stored in row-major among scattered disks.

                            compute node

    ____		    _____		    _____	            ____
  _| a11|                  | a12 |                 | a13 |                 
|a14 |
 | |____|                  |_____|                 |_____|                 
 | | a21| -----            | a22 |                 | a23 |                 
|a24 |
 | |____|      \           |_____|                 |_____|                 
 | | a31| ------\----      | a32 |                 | a33 |                 
|a34 |
 | |____|        \   \     |_____|                 |_____|                 
 | | a41| --------\---\--- | a42 |                 | a43 |                 
|a44 |
 | |____|          \   \  \|_____|                 |_____|                 
 |                  \   \  \
 |                   \   \__\________________
 \                    \      \_______________\ _______________________
 _\________________   _\_________________    _\________________     
| a11 a12 a13 a14  | | a21 a22 a23 a24   |  | a31 a32 a33 a34  |   | a41 
a42 a43 a44 |
|__________________| |___________________|  |__________________|   


It if obvious that in order to meet clients requirements, we have to do 
4x4 = 16 (times)
transfering of data.

Of course, the above example is a extreme bad one.
However, I want to conceive a method which is eclective or proper in most 

I will state my method in the next mail soon.
 (5) Mapping global matrix data among disks in "BLOCKS".

Still, assume that I need to maintain a matrix A(4x4) as below. 

A(4x4) =  [  a11   a12   a13   a14 ]
          [  a21   a22   a23   a24 ]
          [  a31   a32   a33   a34 ]
          [  a41   a42   a43   a44 ]

I can divide it into blocks as follows:

          [  a11   a12  |   a13   a14 ]
          [  a21   a22  |   a23   a24 ]             block1  |  block2
                        |                                   | 
    ____________________|__________________      ___________|_____________
                        |                                   |  
          [  a31   a32  |   a33   a34 ]             block3  |  block4
          [  a41   a42  |   a43   a44 ]                     | 

Appearance of logic file (linear -- according to "blocks")


   block1          block2          block3           block4 
|  a11 a12 a21 a22 a13 a14 a23 a24 a31 a32 a41 a42 a33 a34 a43 a44  |

"Blocks" method of partitioning the above global data structures between 
local disks.

      disk1                disk2               disk3                disk4
 _________________   _________________   _________________    
| a11 a12 a21 a22 | | a13 a14 a23 a24 | | a31 a32 a41 a42 |  | a33 a34 a43 
a44 |
|_________________| |_________________| |_________________|  

  For instance, compute node (application) require the matrix factors in  
column, while the
  data are stored in blocks among scattered disks.

                            compute node

    ____		    _____		    _____	            ____
   | a11|                  | a12 |                 | a13 |                 
|a14 |
  {|____|                 {|_____|                 |_____|                 
 /{| a21|                /{| a22 |                 | a23 |                 
|a24 |
 | |____|               |  |_____|                 |_____|                 
 | | a31|               |  | a32 |                 | a33 |                 
|a34 |
 | |____|}              |  |_____|                 |_____|                 
 | | a41|}              |  | a42 |                 | a43 |                 
|a44 |
 | |____|               |  |_____|                 |_____|                 
 |         _____________|
 |________/_    \                 
 ___/____/__\____\__  ___________________    __________________     
| a11 a12  a21  a22| | a13 a14 a23 a24   |  | a31 a32 a41 a42  |   | a33 
a34 a43 a44 |
|__________________| |___________________|  |__________________|   


It is obvious that in order to meet clients requirements, we just have to 
do 4x2 = 8 (times)
transfering of data.

How can I give the formula prove of my thoughts to show that it is 

I have to leave now because the lab will be closed until 2:00 pm this 



(2) Appearance of logic file (linear -- row-major for example)

|  a11 a12 a13 a14 a21 a22 a23 a24 a31 a32 a33 a34 a41 a42 a43 a44 |

(3) Normal method of partitioning the above global data structures between 
local disks.

      disk1                disk2               disk3                disk4
 _________________   _________________   _________________    
| a11 a12 a13 a14 | | a21 a22 a23 a24 | | a31 a32 a33 a34 |  | a41 a42 a43 
a44 |
|_________________| |_________________| |_________________|  

(4) Potential disadvantages of above normal method (partition according to 
  For instance, compute node (application) require the matrix factors in  
column, while the
  data are stored in row-major among scattered disks.

                            compute node

    ____		    _____		    _____	            ____
  _| a11|                  | a12 |                 | a13 |                 
|a14 |
 | |____|                  |_____|                 |_____|                 
 | | a21| -----            | a22 |                 | a23 |                 
|a24 |
 | |____|      \           |_____|                 |_____|                 
 | | a31| ------\----      | a32 |                 | a33 |                 
|a34 |
 | |____|        \   \     |_____|                 |_____|                 
 | | a41| --------\---\--- | a42 |                 | a43 |                 
|a44 |
 | |____|          \   \  \|_____|                 |_____|                 
 |                  \   \  \
 |                   \   \__\________________
 \                    \      \_______________\ _______________________
 _\________________   _\_________________    _\________________     
| a11 a12 a13 a14  | | a21 a22 a23 a24   |  | a31 a32 a33 a34  |   | a41 
a42 a43 a44 |
|__________________| |___________________|  |__________________|   


It is obvious that in order to meet clients requirements, we have to do 
4x4 = 16 (times)
transfering of data.

Of course, the above example is an extremely bad one.
However, I want to conceive a method which is eclective or proper in most 



                            compute node

    ____		    _____		    _____	            ____
  _| a11|                  | a12 |                 | a13 |                 
|a14 |
 | |____|                  |_____|                 |_____|                 
 | | a21| -----            | a22 |                 | a23 |                 
|a24 |
 | |____|      \           |_____|                 |_____|                 
 | | a31| ------\----      | a32 |                 | a33 |                 
|a34 |
 | |____|        \   \     |_____|                 |_____|                 
 | | a41| --------\---\--- | a42 |                 | a43 |                 
|a44 |
 | |____|          \   \  \|_____|                 |_____|                 
 |                  \   \  \
 |                   \   \__\________________
 \                    \      \_______________\ _______________________
 _\________________   _\_________________    _\________________     
| a11 a12 a13 a14  | | a21 a22 a23 a24   |  | a31 a32 a33 a34  |   | a41 
a42 a43 a44 |
|__________________| |___________________|  |__________________|   


It is obvious that in order to meet clients requirements, we have to do 
4x4 = 16 (times)
transfering of data.

Of course, the above example is an extremely bad one.
However, I want to conceive a method which is eclective or proper in most 