A Functional-Link Neural Network in MPI

by Ulf Dittmer


Building on earlier work of mine ("Implementation of a Neural Network on a Data Parallel Computer", CIS 712/CIS 731 Semester Project, Fall 1995), I chose to implement a functional-link type neural network architecture in C + MPI. This type of network was introduced by Pao ("Adaptive Pattern Recognition and Neural Networks", Addison-Wesley, 1989, pp. 197+); this implementation differs in some ways from the original model:

Parallelization

Parallel execution occurs in terms of input samples. All available input samples are evenly distributed among processors; each one determines the weight changes necessitated by the input samples assigned to it. After that, the weight changes have to summed up over all input samples (which means all processors), and the sum communicated back to each CPU. This is done by an MPI_Allreduce working on an aray that contains all weight changes. Now the actual change can be performed, taking learning rate and momentum into account.

After each weight change the remaining error of the resulting network is checked. For checking every CPU determines how many input samples were not within maxep-range of their desired output value, as well as the summed-up total error for all samples. Both these numbers are summed up over all processors by MPI_Allreduce. If for any sample the desired value was not reached, or if the total error is larger than maxe, the weight-update/error-check cycle is repeated.

Besides MPI_Allreduce, MPI_Bcast is the only other MPI call used (in the initialization phase when several user-supplied values have to be communicated to all processors). No point-to-point communication is done.


The Input File

At program start the name of the file containing the training samples must be input. This file should have the number of input nodes in the first line, and the number of samples in the second line. After that, a line for each sample should follow that specifies the input value and the desired output value. The following example file would train the network to calculate the bit parity for 4 inputs, returning 0 as output if the number of "1"s between the inputs is even, and 1 otherwise.
4
16
0.0  0.0  0.0  0.0    0.0
0.0  0.0  0.0  1.0    1.0
0.0  0.0  1.0  0.0    1.0
0.0  0.0  1.0  1.0    0.0
0.0  1.0  0.0  0.0    1.0
0.0  1.0  0.0  1.0    0.0
0.0  1.0  1.0  0.0    0.0
0.0  1.0  1.0  1.0    1.0
1.0  0.0  0.0  0.0    1.0
1.0  0.0  0.0  1.0    0.0
1.0  0.0  1.0  0.0    0.0
1.0  0.0  1.0  1.0    1.0
1.0  1.0  0.0  0.0    0.0
1.0  1.0  0.0  1.0    1.0
1.0  1.0  1.0  0.0    1.0
1.0  1.0  1.0  1.0    0.0


Speedup

To test parallel performance, a network was trained to calculate the even bit parity of four binary inputs. The desired output was "0" for an even number of "1"s among the inputs, and "1" otherwise. For the solution, 20 functions were used, all of them linear combinations (or squares) of the inputs. The network was trained with various numbers of input samples and on 1, 2, 4 and 8 processors. For comparison purposes, a plain C version of the program was also timed. The number of iterations needed differed slightly between training runs, as the weights are initialized randomly, but was around 2000 iterations for all runs. The network was trained with a maxe value of 0.000001 and a maxep value of 0.00001. All timings are in seconds on 150MHz DEC Alpha workstations.

As a comparison, another version was also timed, which checks for the remaining error only every tenth weight update, thus raising the productive calculation/unproductive calculation rate. This newer version also combines two MPI_Bcast calls in the check_error subroutine into one. (Something similar could be done in the init_session subroutine, by combining maxe and maxep into a two-element array and thus saving one MPI call. But that is during initialization, and thus not time critical.)

		I		number of input samples
		I
   number of	I	16	64	256	1024	4096	16384
   processors	I
------------------------------------------------------------------------
  sequential C	I
 one check per	I	1s	3s	10s	40s	161s	687s
   one update	I
------------------------------------------------------------------------
		I
	1	I	1s	5s	19s	74s	298s	1245s
		I
	2	I	9s	11s	20s	45s	159s	561s
		I
	4	I	20s	20s	23s	35s	91s	289s
		I
	8	I	30s	30s	31s	36s	60s	165s
		I
------------------------------------------------------------------------
  sequential C	I
 one check per	I	1s	2s	7s	24s	116s	491s
  ten updates	I
------------------------------------------------------------------------
		I
	1	I	1s	5s	14s	53s	223s	---
		I
	2	I	4s	5s	9s	26s	111s	421s
		I
	4	I	8s	9s	9s	19s	59s	208s
		I
	8	I	13s	14s	14s	18s	38s	112s
		I
------------------------------------------------------------------------


Discussion of results

Comparison of the two versions of the program

In the newer version, instead of running check_error every time update_weight is run, it is only called every tenth time. The major time consumers in check_error (besides the MPI_Allreduce call) are probably the FP multiplications, of which there are 2 for every 5 in update_weights. So we would expect to see roughly an increase in speed of 2/7 ~= 30%. For large numbers of samples we are getting roughly that, for smaller numbers the gain is larger, as the omission of one MPI call in every update cycle makes a noticeable difference.
The program now has a RATIO constant that determines for how many calls of update_weights check_error is called once. My trial runs would be equivalent to values of 1 and 10 for RATIO.

Speedup results

At least 1024 samples are needed to get speedups at all, and at least 4096 to show decent gains. That is a comparatively large number (especially since for this problem - 4 bit parity - there are only 16 distinct samples :-), but this is not an ordinary neural network: The node functions are precalculated and just looked up during runtime; replacing fv[i][j] by f(i,j) in update_weights and check_error might result in better speedups - though longer runtimes.
Another reason might be that MPI_Allreduce is a rather complex, and thus necessarily time-consuming, routine. It has to gather data from all processors, sum it up, and distribute it back to all CPUs; so it might not be possible to speed up the code further (besides raising the computation-to-communication ratio).

Source code and data