There are usually two classical ways to improve the performance of computers. The first one is to improve the performance of a single processor. For example, improving the integrity of the chip can effectively increase the signal transmission rate inside the chip, thus improve the performance of it. But this technology is limited by the rate of light transmission. The second one is to develop the parallelism of the question to be solved. And design parallel algorithm and parallel architecture to improve the overall performance. It is widely used in current research field to speed up the computing. Let's explain the latter performance gain by the most frequently used computation, matrix multiplication. Say, we have two matrix, of l rows by m columns and m rows by n columns, and we want to calculate the multiplication result matrix of l rows by n columns. Normally, the calculation performed in one processor is a three nested loops. The time complexity is l by m by n. In parall algorithm, the processors can be thought as in a ring topology. And the first matrix can be distributed by rows (`block distribution') and the second matrix can be distributed by columns (`block-cyclic' distribution) to different processors. Each processor, 1) performs the mutiplication, 2) exchange the value of the columns of the second matrix(m rows by n columns) by passing the columns along the ring until the original columns arrived to the original processor. In this way, the calculation times can almost be reduced to l*m*n/(number of processors) + communication time. In order to improve the performance, the programmer should manage to increase the computing/communicating ratio in program.