Proposed Experiments for NPAC (Addendum):
- -------- ----------- --- ---- ----------


Continuous Data Streaming on the SP2 

This memo describes an experiment in high-performance SP2 communications
that can be done at NPAC. We're not able to carry out this study here at
APT because we don't have access to an SP2. 

The experiment is intended to closely model the data communications
capabilities required for data mining applications where huge volumes of
data are to be processed.

In general the experiment requires that multiple communicating processes
run on each node of the SP2.  The four communications paradigms to be
explored are those that will support multiple communicating processes per
node:

1) MPL/MPI (over IP using HSS)
2) PVMe (over user-space HSS)
3) UDP/IP datagrams over HSS,
   (with reliability layer atop UDP)
4) TCP/IP over HSS

(Note: It is important to understand that TCP/IP is the most attractive
protocol to APT because it is portable. We will move to other protocols
1,2, or 3 only if there is a very significant performance improvement.)

The experiment is to have 2 or more processes on each node of the SP2. We
assign letters to these processes on each node, A, B, C, ... up to Z.
(There need not be 26 proceses total.) We call process A the first process
and process Z the last process, and there is an implied ordering from A to
Z.

The experiment is described graphically in the accompanying diagram.  The
first process on each node sequentially reads a data file which is on a
local disk. This data file is large, e.g., 200 Mbytes. The first process
then takes blocks of this data and distributes them at random to the B
processes on each of the other nodes including the B process on this same
node.

The B process on each node randomly merges together all incoming streams
of data from A processes on this and other nodes, then it randomly
distributes the blocks of data to the C processes on this and other 
nodes.

The C process is identical to the B process, except that it takes data from
B processes and sends the data on to D processes, and so on. The data
eventually finds its way to the last process. This last process writes the
data out to a local disk file. (The total size of these output disk files
should be the same as the total size of the input disk files.)

The above communication scenario should be run for block sizes from 4k, 8k,
32k, 64k, 256k, 1M, on 1,2,4,8, and 12 SP2 nodes with 2, 4, 8, 16 and 32
processes on each node. In addition the experiments should involve both
real-disk files as described above and a synthetic data set where the first
process just generates the data (all zeros) and the last process just drops
the data. This eliminates the disk bottleneck from the system.

>From the structure of the above experiment it should become clear why are
we interested in using TCP.  There are some important things to watch out
for.  We expect that the lack of reverse flow control in the connectionless
protocols may result in very poor behavior, e.g., if the last process on
each node can deliver data to disk with bandwidth B, then each of the k
inbound connection from a node preceeding it can send at at most bandwidth
B/k on average. Using TCP one can expect this bandwidth limitation to
propagate backward all the way to the first processes on each node, which
will consume data from disk at roughly the rate B/k. Using connectionless
protocols we may find that the network boggs down with retransmissions of
data that cannot be accepted fast enough. This experiment is intended to
highlight these problems and to allow us to measure whether TCP solves
these problems and provides adequate performance and over what scalability
range.

I should also mention that the transputer folks have been saying for years
that connection-oriented/virtual-channel protocols are important to provide
scalability and performance. I tend to think they've caught on to 
a very important idea.