Attachment 1 APT Experiments on the Maui SP2 System: The test run at Maui consists of running two processes on different nodes. One process acts at the receiver and the other as the sender. The processes create some number of sockets connecting themselves (typically between 1 and 1000) and the sender sends data over these sockets to the receiver. The net throughput is measured. Parameters that can be configured are whether the process will be sender or receiver (the receiver must be started first), how many connections will be made, how many different ports will be listened on (the total number of connections made is the product of the number of connections made and the number of ports listened on), the blocksize of the data transfers, the number of blocks transferred, and the port number to be listened on (defaults to 10000, but this may be varied if port 10000 is already in use). A sample command line of this test: rsh $receiver -n './multisocket -r -c 32 -b 65536 -B 400 -p 10007'& sleep 10 rsh $sender -n "./multisocket -t -c 32 -b 65536 -B 400 -p 10007 $rreceiver"& wait In this case, $receiver is the ethernet name of the receiving node, $sender is the ethernet name of the sending node, and $rreceiver is the switch name of the receiving node. The processes will make 32 connections (using a single listening port), and exchange 400 blocks of 64 K bytes on each port (total number of bytes is 32 * 65536 * 400, or 800 Mbytes). The receiver will listen on port 10007 for the initial connections. Using these parameters, we have achieved over 20 million bytes per second throughput: Send: et 40.151204 bytes 838860800 Mbytes/sec 20.892544 0.0u 31.0s 162im 0sw Receive: et 40.163916 bytes 838860800 Mbytes/sec 20.885932 0.0u 36.0s 160im 0sw Attachment 2. Proposed Experiments for NPAC (Addendum): Continuous Data Streaming on the SP2 This memo describes an experiment in high-performance SP2 communications that can be done at NPAC. We're not able to carry out this study here at APT because we don't have access to an SP2. The experiment is intended to closely model the data communications capabilities required for data mining applications where huge volumes of data are to be processed. In general the experiment requires that multiple communicating processes run on each node of the SP2. The four communications paradigms to be explored are those that will support multiple communicating processes per node: 1) MPL/MPI (over IP using HSS) 2) PVMe (over user-space HSS) 3) UDP/IP datagrams over HSS, (with reliability layer atop UDP) 4) TCP/IP over HSS (Note: It is important to understand that TCP/IP is the most attractive protocol to APT because it is portable. We will move to other protocols 1,2, or 3 only if there is a very significant performance improvement.) The experiment is to have 2 or more processes on each node of the SP2. We assign letters to these processes on each node, A, B, C, ... up to Z. (There need not be 26 proceses total.) We call process A the first process and process Z the last process, and there is an implied ordering from A to Z. The experiment is described graphically in the accompanying diagram. The first process on each node sequentially reads a data file which is on a local disk. This data file is large, e.g., 200 Mbytes. The first process then takes blocks of this data and distributes them at random to the B processes on each of the other nodes including the B process on this same node. The B process on each node randomly merges together all incoming streams of data from A processes on this and other nodes, then it randomly distributes the blocks of data to the C processes on this and other nodes. The C process is identical to the B process, except that it takes data from B processes and sends the data on to D processes, and so on. The data eventually finds its way to the last process. This last process writes the data out to a local disk file. (The total size of these output disk files should be the same as the total size of the input disk files.) The above communication scenario should be run for block sizes from 4k, 8k, 32k, 64k, 256k, 1M, on 1,2,4,8, and 12 SP2 nodes with 2, 4, 8, 16 and 32 processes on each node. In addition the experiments should involve both real-disk files as described above and a synthetic data set where the first process just generates the data (all zeros) and the last process just drops the data. This eliminates the disk bottleneck from the system. From the structure of the above experiment it should become clear why are we interested in using TCP. There are some important things to watch out for. We expect that the lack of reverse flow control in the connectionless protocols may result in very poor behavior, e.g., if the last process on each node can deliver data to disk with bandwidth B, then each of the k inbound connection from a node preceeding it can send at at most bandwidth B/k on average. Using TCP one can expect this bandwidth limitation to propagate backward all the way to the first processes on each node, which will consume data from disk at roughly the rate B/k. Using connectionless protocols we may find that the network boggs down with retransmissions of data that cannot be accepted fast enough. This experiment is intended to highlight these problems and to allow us to measure whether TCP solves these problems and provides adequate performance and over what scalability range. I should also mention that the transputer folks have been saying for years that connection-oriented/virtual-channel protocols are important to provide scalability and performance. I tend to think they've caught on to a very important idea. To the proposed experiment I would add a scenario in which all of the communication between some of the adjacent processes is kept entirely local (i.e., instead of each node in process group A sending data to all of the nodes in process group B, have each node in A send data to only the process in B that resides on the same node). Also, when using TCP/IP I suggest that the connection between two processes on the same node be a Unix domain socket, rather than an Internet domain socket. This should have a large effect on the case described above.