Ed: Enclosed please find some preliminary benchmark results we conducted last week. Prior to implementing the multi-threading benchmark model APT proposed, the major purpose of this preliminary benchmark is to verify on our SP2 the 20 MB/s bandwidth of TCP you observed on Maui's SP2 and to compare the same basic point-to-point bandwidth for both TCP protocol and the native MPL protocol under IP on both wide and thin nodes. In summary, the results include: 1. point-to-point communication peak bandwidth in TCP protocol and MPL protocol 2. the bandwidth results on both wide and thin nodes 3. For the benchmark on TCP, peak performance of either single sokect connection and multiple sockets are measured. (the multi-socket test should be similar to what you used for your measurement on Maui's SP2). For MPL test, only single blocked send/receive pair is used in the measurement. In the case of MPL tests, the bandwidth over both IP and the "user space" are measured. 4. Different application buffer sizes are tested. 5. Latency is not measured in all the tests. Please let me know if you have any question. Thanks. --- Gang List below is the detailed results: (1) To understand performance impact of different buffer sizes for different protocols, Some basic benchmarks on a two-node SP2 are conducted. Best bandwidth observed to tranfer 800MB data one way from one node to another via HPS: Wide nodes (smerlin11 -> smerlin12) --------------------------------------------------------------------------------- ip user space --------------------------------------------------------------------------------- TCP (single socket connection) 17511.8 KB/s (b=64KB) - MPL (single message-passing) 11328.8 KB/s (b=256KB) 34625.9 KB/s (b=4MB) Thin nodes (smerlin7 -> smerlin8) ------------------------------------------------------------------------------- ip user space ------------------------------------------------------------------------------- TCP (single socket connection) 17403 KB/s (b=64KB) - MPL (single message-passing) 10043.8 KB/s (b=2MB) 34401.6 KB/s (b=2MB) Note: 1. Different buffer sizes gave quite different bandwidth, in a range of 1-2 MB/s. Reported here (b=##) are the ones corresponding to the best bandwidth observed. All the timings are obtained under a non-shared mode of both node CPU and HPS adaptor. So the timing should be as accurate as it is. 2. The default system send/receive buffer size for TCP socket is 122880 bytes for the switch (64240 for the Ethernet), while the system-wide default maximum segment size (TCP_MAXSEG) is 61428 bytes for the switch (1448 for for the Ethernet). It is not clear to me what is the system buffer size used in MPL for message passing. 3. There are several different kinds of SP2 nodes. While they all have the same peak performance (266 MFlops) they vary greatly in memory bandwidth. The three types of nodes are: <1> Thin nodes. 64-bit wide memory bus <2> Wide nodes with 1 or 2 memory cards. 128-bit wide memory bus. (Each card can have an arbitrary amount of memory. It is the number of cards currently installed that matters). <3> Wide nodes with 4 or 8 memory cards. 256-bit wide memory bus. Most Maui nodes are of the first type, and some are of the second type. NPAC's 4 wide nodes belong to the third type which is the fastest one among the three. (2) Peak, average and worst bandwidth observed for multi-socket tests in TCP over HPS Wide Nodes (smerlin11 & 12) #Connections Buff Size #Buffers Bandwith High Bandwith Low Bandwith Mean (on 5 runs) ------------ --------- -------- ------------- ------------ --------------------------- 1 64 KB 12800 18,075,141 17,775,015 17,912,739 16 64 KB 800 16,346,350* 16,179,606* 16,251,104* 32 64 KB 400 16,346,350* 16,131,379* 16,256,846* * Some other processes could have been active while these results were taken. #Connections Buff Size #Buffers Bandwith High Bandwith Low Bandwith Mean (on 9 runs) ------------ --------- -------- ------------- ------------ --------------------------- 32 64 KB 400 18,067,936 16,023,000 17,114,392 Thin Nodes (smerlin7 & 8) #Connections Buff Size #Buffers Bandwith High Bandwith Low Bandwith Mean (on 9 runs) ------------ --------- -------- ------------- ------------ --------------------------- 1 64 KB 12800 16,468,276 15,837,918 16,053,986 16 64 KB 800 17,801,012 17,664,813 17,744,165 32 64 KB 400 17,677,994 17,533,509 17,631,083 (3) Bandwidth of multi-socket tests with different (application) buffer sizes Thin Nodes (smerlin7 & 8) #Connections Buff Size #Buffers Bandwith High Bandwith Low Bandwith Mean (on 9 runs) ------------ --------- -------- ------------- ------------ --------------------------- 32 60 KB 427 17,687,004 17,548,585 17,661,463 32 60 KB 427 17,689,432 17,533,859 17,604,787 32 61 KB 420 17,612,465 17,518,525 17,571,681 32 62 KB 413 17,651,203 17,514,225 17,599,476 32 63 KB 407 17,699,964 17,530,004 17,618,872 32 64 KB 400 17,705,514 17,515,541 17,623,432 32 65 KB 394 17,657,039 17,580,513 17,619,166 32 66 KB 388 17,703,212 17,592,181 17,642,788 32 67 KB 382 17,703,860 17,621,020 17,665,972 32 68 KB 376 17,695,564 17,484,549 17,622,689 Note: 1. For the multi-socket benchmark, the bandwidth data is measured in the following model: (1) first the server and client programs are started, each on different nodes (2) the server program listens on a port, waiting for the socket connection requests from the client (3) the client first forks #connection child processes, each of which sends a connection request over the port to the server host (over HPS) (4) upon receving each connection request from the client, the server forks a new child process with an unique socket connection to the client. At this point, one connection between the client (transmitter) and the server (receiver) is created. All the further data transfer will be handled by the client child and the server child (5) after all the #connection of connections are created (in the client child processes), all the client children are waiting for a 'start-send' signal from the client parent to start the actual data tranmission over each connection. (6) The client parent sends a 'start-send' signal to all its children. Upon receiving the signal, each child first writes a 'start' timing-stamp and then starts the data transfer by invoking 800MB/(#connection x buffer size) times of the socket 'read's, each with the same buffer size (and the same buffer). After all the data are tranfered, the children writes another 'end' time-stamp. Thus each client children has a 'start' time-stamp and a 'end' time-stamp. The client parent collects (via a shared memory) all the time-stamps and sorts out the earliest 'start' time-stamp and the latest 'end' time-stamp to generate the actual elapse time spent on sending total 800MB over the multiple connections. This is how a single run's bandwidth is calculated. This is the timing from the transmitter (client) side. Timing from the receiver (server) side is not as accurate as the client, because the connection is created and a timing-stamp is record, the server has to wait a short period of time until the first blocks of data sent from the client. 2. for each test, a number of runs (5 or 9) are conducted and the high,low and average bandwidth measured in those runs are reported here. (3) Conclusions . The best bandwidth under TCP we got is about 18 MB/s. It is not clear what factors contributed to the 2 MB/s difference from what APT observed (20 MB/s). . using multi-socket connections do not improve the aggragated bandwidth. . The best point-to-point bandwidth in the MPL benchmark over IP is about 11.2 MB/s. This is significantly slower than the point-to-point bandwidth under bare TCP sockets (18 MB/s). . There are some bandwidth difference on wide and thin nodes, within range of 1 - 2 MB/s. . Some parameters, i.e., buffer sizes, are important to obtain the peak bandwidth in either TCP or MPL over IP tests. For example, in the MPL over IP tests, it is very easy to obtain the 7-8 MB/s bandwidth which is reported by many other sources (and was the 'stardard' quotation of the MPL peak over IP). But the peak one (11.2 MB/s) we obtained is by using a quite specific buffer size. It is noticed that the similar parameters (buffer size) is less sensative in the MPL tests over 'user space'. We can easily obtain the 33-34 MB/s bandwidth with quite different buffer sizes. This can be contributed from the latency introduced in IP protocol (layers), i.e., for send/recive, the application buffer has to be copied to/from the (UNIX) OS kernel system buffer for message segmentation and packeting, before/after the actual data transfer occurs, while in the 'user space' protocol, those segmentation and packeting is well optimized, ultilizing and taking advantages of the HPS Adaptor's configuration. Thus, when application buffer size does not quite fit into the system buffer, more overhead incurred.