Ed:

Enclosed please find some preliminary benchmark results we
conducted last week. Prior to implementing the multi-threading
benchmark model APT proposed, the major purpose of this
preliminary benchmark is to verify on our SP2 the 20 MB/s bandwidth
of TCP you observed on Maui's SP2 and to compare the same basic
point-to-point bandwidth for both TCP protocol and the native MPL
protocol under IP on both wide and thin nodes. In summary, the
results include:

1. point-to-point communication peak bandwidth in TCP protocol
and MPL protocol

2. the bandwidth results on both wide and thin nodes

3. For the benchmark on TCP, peak performance of either
single sokect connection and multiple sockets are measured.
(the multi-socket test should be similar to what you used
for your measurement on Maui's SP2). For MPL test, only 
single blocked send/receive pair is used in the measurement.
In the case of MPL tests, the bandwidth over both IP and the "user
space" are measured.

4. Different application buffer sizes are tested.

5. Latency is not measured in all the tests.

Please let me know if you have any question. Thanks.

--- Gang


List below is the detailed results:

(1) To understand performance impact of different buffer sizes
for different protocols, Some basic benchmarks on a two-node SP2
are conducted. Best bandwidth observed to tranfer 800MB data 
one way from one node to another via HPS:

Wide nodes (smerlin11 -> smerlin12)
---------------------------------------------------------------------------------
                                      ip                      user space
---------------------------------------------------------------------------------
TCP (single socket connection)      17511.8 KB/s (b=64KB)   	   -
MPL (single message-passing)        11328.8 KB/s (b=256KB)   34625.9 KB/s (b=4MB)

Thin nodes (smerlin7 -> smerlin8)
-------------------------------------------------------------------------------
                                            ip                user space
-------------------------------------------------------------------------------
TCP (single socket connection)      17403   KB/s (b=64KB)   	   -
MPL (single message-passing)        10043.8 KB/s (b=2MB)   34401.6 KB/s (b=2MB)

Note: 

1. Different buffer sizes gave quite different bandwidth, in a range of
1-2 MB/s. Reported here (b=##) are the ones corresponding to the best bandwidth observed.
All the timings are obtained under a non-shared mode of
both node CPU and HPS adaptor. So the timing should be as accurate
as it is.

2. The default system send/receive buffer size for TCP socket
is 122880 bytes for the switch (64240 for the Ethernet), while the 
system-wide default maximum segment size (TCP_MAXSEG) is 61428 bytes 
for the switch (1448 for for the Ethernet). It is not clear to me
what is the system buffer size used in MPL for message passing.

3. There are several different kinds of SP2 nodes. While they all 
have the same peak performance (266 MFlops) they vary greatly in
memory bandwidth. The three types of nodes are: 
  <1> Thin nodes. 64-bit wide memory bus
  <2> Wide nodes with 1 or 2 memory cards. 128-bit wide memory bus. 
     (Each card can have an arbitrary amount of memory. It is
      the number of cards currently installed that matters). 
  <3> Wide nodes with 4 or 8 memory cards. 256-bit wide memory bus. 
      Most Maui nodes are of the first type, and some are of the second type. 
      NPAC's 4 wide nodes belong to the third type which is the fastest one among 
      the three.

(2) Peak, average and worst bandwidth observed for multi-socket tests in TCP over HPS

Wide Nodes (smerlin11 & 12)
#Connections  Buff Size  #Buffers  Bandwith High  Bandwith Low  Bandwith Mean (on 5 runs)
------------  ---------  --------  -------------  ------------  ---------------------------
    1           64 KB     12800     18,075,141     17,775,015    17,912,739
   16           64 KB       800     16,346,350*    16,179,606*   16,251,104*
   32           64 KB       400     16,346,350*    16,131,379*   16,256,846*

* Some other processes could have been active
  while these results were taken. 

#Connections  Buff Size  #Buffers  Bandwith High  Bandwith Low  Bandwith Mean (on 9 runs)
------------  ---------  --------  -------------  ------------  ---------------------------
   32           64 KB       400     18,067,936     16,023,000    17,114,392


Thin Nodes (smerlin7 & 8)
#Connections  Buff Size  #Buffers  Bandwith High  Bandwith Low  Bandwith Mean (on 9 runs)
------------  ---------  --------  -------------  ------------  ---------------------------
    1           64 KB     12800     16,468,276     15,837,918    16,053,986
   16           64 KB       800     17,801,012     17,664,813    17,744,165
   32           64 KB       400     17,677,994     17,533,509    17,631,083


(3) Bandwidth of multi-socket tests with different (application) buffer sizes

Thin Nodes (smerlin7 & 8)
#Connections  Buff Size  #Buffers  Bandwith High  Bandwith Low  Bandwith Mean (on 9 runs)
------------  ---------  --------  -------------  ------------  ---------------------------
   32           60 KB       427     17,687,004     17,548,585    17,661,463
   32           60 KB       427     17,689,432     17,533,859    17,604,787
   32           61 KB       420     17,612,465     17,518,525    17,571,681
   32           62 KB       413     17,651,203     17,514,225    17,599,476
   32           63 KB       407     17,699,964     17,530,004    17,618,872
   32           64 KB       400     17,705,514     17,515,541    17,623,432
   32           65 KB       394     17,657,039     17,580,513    17,619,166
   32           66 KB       388     17,703,212     17,592,181    17,642,788
   32           67 KB       382     17,703,860     17,621,020    17,665,972
   32           68 KB       376     17,695,564     17,484,549    17,622,689

Note:

1. For the multi-socket benchmark, the bandwidth data is measured in the following model:
   (1) first the server and client programs are started, each on different nodes
   (2) the server program listens on a port, waiting for the socket connection requests
       from the client
   (3) the client first forks #connection child processes, each of which sends a connection
       request over the port to the server host (over HPS)
   (4) upon receving each connection request from the client, the server forks a new child
       process with an unique socket connection to the client. At this point, one connection
       between the client (transmitter) and the server (receiver) is created. All the further
       data transfer will be handled by the client child and the server child
   (5) after all the #connection of connections are created (in the client child processes),
       all the client children are waiting for a 'start-send' signal from the client parent
       to start the actual data tranmission over each connection.
   (6) The client parent sends a 'start-send' signal to all its children. Upon receiving the
       signal, each child first writes a 'start' timing-stamp and then starts the data transfer by
       invoking 800MB/(#connection x buffer size) times of the socket 'read's, each with the same 
       buffer size (and the same buffer). After all the data are tranfered, the children writes
       another 'end' time-stamp. Thus each client children has a 'start' time-stamp and a 'end' 
       time-stamp. The client parent collects (via a shared memory) all the time-stamps and sorts
       out the earliest 'start' time-stamp and the latest 'end' time-stamp to generate the actual
       elapse time spent on sending total 800MB over the multiple connections. This is how a single
       run's bandwidth is calculated. This is the timing from the transmitter (client) side. 
       Timing from the receiver (server) side is not as accurate as the client, because the 
       connection is created and a timing-stamp is record, the server has to wait a short period 
       of time until the first blocks of data sent from the client.
       
2. for each test, a number of runs (5 or 9) are conducted and the high,low and average bandwidth
   measured in those runs are reported here. 

(3) Conclusions

. The best bandwidth under TCP we got is about 18 MB/s. It is not clear what factors contributed
  to the 2 MB/s difference from what APT observed (20 MB/s).
. using multi-socket connections do not improve the aggragated bandwidth.
. The best point-to-point bandwidth in the MPL benchmark over IP is about 11.2 MB/s. This is
  significantly slower than the point-to-point bandwidth under bare TCP sockets (18 MB/s).
. There are some bandwidth difference on wide and thin nodes, within range of 1 - 2 MB/s.
. Some parameters, i.e., buffer sizes, are important to obtain the peak bandwidth in either TCP or
  MPL over IP tests. For example, in the MPL over IP tests, it is very easy to obtain the 
  7-8 MB/s bandwidth which is reported by many other sources (and was the 'stardard' quotation 
  of the MPL peak over IP). But the peak one (11.2 MB/s) we obtained is by using a quite specific 
  buffer size. It is noticed that the similar parameters (buffer size) is less sensative in 
  the MPL tests over 'user space'. We can easily obtain the 33-34 MB/s bandwidth with quite 
  different buffer sizes. This can be contributed from the latency introduced in IP protocol (layers),
  i.e., for send/recive, the application buffer has to be copied to/from the (UNIX) OS kernel 
  system buffer for message segmentation and packeting, before/after the actual data transfer occurs,
  while in the 'user space' protocol, those segmentation and packeting is well optimized, 
  ultilizing and taking advantages of the HPS Adaptor's configuration. Thus, when application
  buffer size does not quite fit into the system buffer, more overhead incurred.