A Functional-Link Neural Network in MPI
Building on earlier work of mine ("Implementation of a Neural Network on a
Data Parallel Computer", CIS 712/CIS 731 Semester Project, Fall 1995),
I chose to implement a functional-link type neural network architecture
in C + MPI. This type of network was introduced by Pao ("Adaptive Pattern
Recognition and Neural Networks", Addison-Wesley, 1989, pp. 197+); this
implementation differs in some ways from the original model:
- The network has one hidden layer. Theoretical results suggest that this
is sufficient for a wide range of problems, since the node functions are not
simple sigmoids, but higher-order functions.
- The weights of the connection between the input layer and the hidden
layer are fixed, namely they are equal to 1. All input-layer nodes are connected
to all hidden-layer nodes; all of these are connected to all outputs.
- The original model starts out with a pool of node functions to choose
from and "retires" connections whose weights have very small absolute values,
on the theory that they are not part of the solution. This has shown to be
not easily achievable and so is not part of this implementation. The program
restricts the choice of functions to linear combinations of the inputs,
quadratic functions and sines/cosines.
- The outputs calculate the weighted sums of the node functions. A more
complex function could be used, but is not necessary here: features of the
output function can be approximated rather accurately by the higher-oder
node functions. Only one output is supported, but it would not be hard to
support an arbitrary number.
- The program supports two parameters: the learning rate and the
momentum. Both are standard techniques for neural networks. I use
values that have (for me) shown to be appropriate for most problems, 0.2 and 0.9,
respectively. The learning rate determines how much the newly calculated
weight change influences the weight; the momentum determines how much the
previously calculated weight change influences the weight. It helps keeping
the learning process from being stuck in local minima.
- Two more parameters can be input at the start of the program:maxe,
the maximally allowed sum of the squares of the output errors for all training
samples, and maxep, the maximally allowed output error for any one sample.
E.g., if the desired output for a specific input sample is 1.0, the network
will consider anything within [1.0-maxep , 1.0+maxep] to be a success.
Parallelization
Parallel execution occurs in terms of input samples. All available input
samples are evenly distributed among processors; each one determines the
weight changes necessitated by the input samples assigned to it. After
that, the weight changes have to summed up over all input samples (which
means all processors), and the sum communicated back to each CPU. This is
done by an MPI_Allreduce working on an array that contains all weight changes.
Now the actual change can be performed, taking learning rate and momentum
into account.
After each weight change the remaining error of the resulting network
is checked. For checking every CPU determines how many
input samples were not within maxep-range of their desired output
value, as well as the summed-up total error for all samples. Both these
numbers are summed up over all processors by MPI_Allreduce. If for any sample
the desired value was not reached, or if the total error is larger than
maxe, the weight-update/error-check cycle is repeated.
Besides MPI_Allreduce, MPI_Bcast is the only other MPI call used (in the
initialization phase when several user-supplied values have to be
communicated to all processors). No point-to-point communication is done.
The Input File
At program start the name of the file containing the training samples must
be input. This file should have the number of input nodes in the first line,
and the number of samples in the second line. After that, a line for each sample
should follow that specifies the input value and the desired output value.
The following example file would train the network to calculate the bit
parity for 4 inputs, returning 0 as output if the number of "1"s between
the inputs is even, and 1 otherwise.
4
16
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 1.0
0.0 0.0 1.0 0.0 1.0
0.0 0.0 1.0 1.0 0.0
0.0 1.0 0.0 0.0 1.0
0.0 1.0 0.0 1.0 0.0
0.0 1.0 1.0 0.0 0.0
0.0 1.0 1.0 1.0 1.0
1.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 1.0 0.0
1.0 0.0 1.0 0.0 0.0
1.0 0.0 1.0 1.0 1.0
1.0 1.0 0.0 0.0 0.0
1.0 1.0 0.0 1.0 1.0
1.0 1.0 1.0 0.0 1.0
1.0 1.0 1.0 1.0 0.0
Speedup
To test parallel performance, a network was trained to calculate the even
bit parity of four binary inputs. The desired output was "0" for an even number
of "1"s among the inputs, and "1" otherwise. For the solution, 20 functions
were used, all of them linear combinations (or squares) of the inputs.
The network was trained with various numbers of input samples and on
1, 2, 4 and 8 processors. For comparison purposes, a plain C version of the
program was also timed. The number of iterations needed differed slightly
between training runs, as the weights are initialized randomly, but was around
2000 iterations for all runs. The network
was trained with a maxe value of 0.000001 and a maxep value
of 0.00001. All timings are in seconds on 150MHz DEC Alpha workstations.
As a comparison, another version was also timed, which checks for the
remaining error only every tenth weight update, thus raising the productive
calculation/unproductive calculation rate. This newer version also
combines two MPI_Bcast calls in the check_error subroutine into one.
(Something similar could be done in the init_session subroutine, by combining
maxe and maxep into a two-element array and thus saving one MPI call. But that
is during initialization, and thus not time critical.)
I number of input samples
I
number of I 16 64 256 1024 4096 16384
processors I
------------------------------------------------------------------------
sequential C I
one check per I 1s 3s 10s 40s 161s 687s
one update I
------------------------------------------------------------------------
I
1 I 1s 5s 19s 74s 298s 1245s
I
2 I 9s 11s 20s 45s 159s 561s
I
4 I 20s 20s 23s 35s 91s 289s
I
8 I 30s 30s 31s 36s 60s 165s
I
------------------------------------------------------------------------
sequential C I
one check per I 1s 2s 7s 24s 116s 491s
ten updates I
------------------------------------------------------------------------
I
1 I 1s 5s 14s 53s 223s ---
I
2 I 4s 5s 9s 26s 111s 421s
I
4 I 8s 9s 9s 19s 59s 208s
I
8 I 13s 14s 14s 18s 38s 112s
I
------------------------------------------------------------------------
Discussion of results
Comparison of the two versions of the program
In the newer version, instead of running check_error every time update_weight
is run, it is only called every tenth time. The major time consumers in check_error
(besides the MPI_Allreduce call) are probably the FP multiplications, of which there
are 2 for every 5 in update_weights. So we would expect to see roughly an increase
in speed of 2/7 ~= 30%. For large numbers of samples we are getting roughly that,
for smaller numbers the gain is larger, as the omission of one MPI call in every
update cycle makes a noticeable difference.
The program now has a RATIO constant that determines for how many calls of
update_weights check_error is called once. My trial runs would be equivalent to
values of 1 and 10 for RATIO.
Speedup results
At least 1024 samples are needed to get speedups at all, and at least 4096 to
show decent gains. That is a comparatively large number (especially since for
this problem - 4 bit parity - there are only 16 distinct samples :-), but this
is not an ordinary neural network: The node functions are precalculated and
just looked up during runtime; replacing fv[i][j] by f(i,j) in update_weights
and check_error might result in better speedups - though longer runtimes.
Another reason might be that MPI_Allreduce is a rather complex, and thus
necessarily time-consuming, routine. It has to gather data from all processors,
sum it up, and distribute it back to all CPUs; so it might not be possible to
speed up the code further (besides raising the computation-to-communication ratio).
Source code and data