There are usually two classical ways to improve the performance of computers.

The first one is to improve the performance of a single processor. For example,
improving the integrity of the chip can effectively increase the signal
transmission rate inside the chip, thus improve the performance of it. But this
technology is limited by the rate of light transmission.

The second one is to develop the parallelism of the question to be solved. And
design parallel algorithm and parallel architecture to improve the overall
performance. It is widely used in current research field to speed up the
computing.

Let's explain the latter performance gain by the most frequently used computation,
matrix multiplication.

Say, we have two matrix, of l rows by m columns and m rows by n columns, and we
want to calculate the multiplication result matrix of l rows by n columns.

Normally, the calculation performed in one processor is a three nested loops. The
time complexity is l by m by n.

In parall algorithm, the processors can be thought as in a ring topology. And the
first matrix can be distributed by rows (`block distribution') and the second
matrix can be distributed  by columns (`block-cyclic' distribution) to different
processors. Each processor, 1) performs the mutiplication, 2) exchange the value
of the columns of the second matrix(m rows by n columns) by passing the columns
along the ring until the original columns arrived to the original processor. 

In this way, the calculation times can almost be reduced to
l*m*n/(number of processors) + communication time.

In order to improve the performance, the programmer should manage to increase the
computing/communicating ratio in program.