The High Performance Switch and its Programming Interfaces on IBM SP2 Gang Cheng, Marek Podgorny (gcheng,marek@npac.syr.edu) Northeast Parallel Architectures Center Syracuse University 1. Introduction High performance switch (HPS) is the central component of IBM 9076 SP2 system, which enables the high-bandwidth and low-latency communication subsystem on SP2. Based on our understanding and experience of using SP2, and collections of materials from other references, this report describes the HPS and programming interfaces built on it. We focus on the transport protocols, message passing libraries, programming interfaces and software layers, as well as hardware configurations of HPS. This is a review report whose major purpose is to provide as informative as possible materials for software developers to have a general yet detailed picture about HPS and parallel programming environments available on SP2. This report will also serve as a base understanding of SP2 for NPAC to design technical strategies of benchmarking the SP2 communication subsystems for APT's data mining software and applications. Some internals of HPS hardware configuration and its supporting software are given to help understanding the HPS from the architectural point of view. 2. Background information about HPS on SP1 and SP2 IBM SPx has experienced several significant upgrades during the past two years since its introduction to the market in 1992. You may notice different terms which is somehow confusing as it may only be valid to one of the SPx platforms. Here is what we found: SP1: EUI(TB0),EUIH(also referred as MPL/p),HPS Adapter-1 SP2: MPL,HPS Adapter-2,User Space(uCSSCI) (1) Terms used for SP1 . EUI is the reference name of the transport protocol on HPS of SP1. It was also the mnemonic for the first release of MPL that was available on the SP1. EUI-H is an experimental upgrade of EUI that later underwent a name change to become MPL in the SP2 environment. EUI-H was also called the light-weight or lightspeed EUI and is a research tool which was later discontinued by IBM on SP2. HPS Adapter-1 was shipped with SP1. (1) Terms used for SP2 . MPL (Message Passing Library) is the component of the IBM AIX Parallel Environment that enables the user to write parallel applications. As its name indicates, the MPL routines enable message passing between individual tasks of a parallel application for communication of data and synchronization of the tasks operations. The enhanced HPS Adapter-2 is new to SP2 system. The two commuications modes supported on SP2 and in MPL are AIX sockets with TCP/IP, and a new high performance user space library. We will describe the major differences between HPS on SP2 and SP1, and the transport protocols supported in later sections. 3. Hardware Configuration Features of HPS on SP2 HPS on SP2 provides the internal(hardwire) message passing fabric that connects all of the SP processors together to allow any-to-any internode connection and all processors to send messages simultaneously. It is a packet-switched network (vs. circuit-switched) and a multistage network which can add switches as system is scaled upward. The peak bandwidth of overall HPS is 40 MBits/s in bi-direction. It support for multi-user environment - multiple jobs may run simultaneously over the switch (one user does not monopolize switch). With built-in error detection, it supports path redundancy that permits routine to be generated even when there are faulty components (eg. nodes) in the system. When HPS on SP2 is referred in general, we are actually talking about three major components of the HPS: Switch Adapters, Communication Links, and a Switch Board, as illustrated in Figure 1 for the interconnection of two nodes. They perform different functions in node-network and network-network(switching) communications. They have different peak bandwidths and constraints due to electronic or physical configuration limitations. The overall 40 MB/s HPS peak bandwidth is the minmum one(s) of the three, namely, the communication links and the Switch board. Figure 1. Diagram of HPS components connecting two nodes ----------- 40 MB/s 40 MB/s ----------- . .------------- communication .------------. communication -------------. . . Node 1 .HPS Adapter .<-------------->.Switch Board.<------------->.HPS Adapter . Node 1 . .(CPU,RAM).------------- link .------------. link -------------.(CPU,RAM). . . 20-80 MB/s 40 MB/s 20-80 MB/s. . ----------- ----------- Communication Link: All communication in SP2 uses point-to-point bidirectional communication links that consist of two channels, each carrying data in opposite directions to their respective output port and input port pairs. For each channel. the actual transmission is via a set of 10 signal lines, 8 for data, one for tag and one for the token. A major enhancement to the SP2 Switch is the incorporation of clock distribution into the switch communication links. In the SP1 Switch, clock distribution is accomplished via a separate clock redrive network. In SP2, each link is addiotnally capable of carrying redriven clock signals in both directions. Switch Adapters: The SP2 switch sdapter card connects a processor's Micro Channel to (both) output and input ports of the switch network via an ASIC chip called the MSMU (Memory & Switch Management Unit). two versions: HPS Adapter-1 (40 MB/s), HPS Adapter-2 (80MB/s). One adapter per SP node. SP2 networks are bidirectional multistage interconnection networks, in which each communication link comprises two channels which carry data inopposite direction. For all SP configurations, there are at least FOUR usable paths between each pair of nodes, except for node pairs that are directly attached to the same switch element. Switch board: A switch board contains 8 logical switch chips with 16 physical chips for reliablity reasons. One switch board per SP frame. The switch operates at 40 MHz, providing peak bandwidth of 40 Mbyte/s over both byte-wide channels of each communication link. It has 500 ns hardware latency. . protocols . IP (Internet Protocol) - default; permits shared usage of HPS-1 or HPS-2 adapter by multiple processes . US CSS (User Space Communication Subsystem) - intended for parallel applications that require maximum communications performance. Only one process per node may use US communications. Any programming interface that makes use of Switch/Us uses POE. . LightSpeed (i.e., EUI-H on SP1, disconnected) . HPS Adapter-2 enhancements . Onboard microprocessor to offload protocol processing from CPU - reduces communication overhead . error checking - message CRC generation and checking (on board) . switch failure recovery . uses multiplexing to permit simultaneous IP and US communications on same node. HPS-1 requires that adapter be dedicated to on or the other - not both. HPS on SP1 EUI and EUIH: EUIH is an experimental implementation of the EUI providing . high performance (bandwidth and latency) . high reliability (fault detection, network fault recovery) . partition management for multiple users . single parallel task per compute node The Major performance difference between EUIH and EUI is the improved mesage-passing latency, as described in [4] below: bechmark data of EUIH from IBM: . RS-6000 model 370 with TB-0 adapters and HPS . there are two options in EUIH implementation which may have different message-pssing latency: polling or interrupting. The interrupt-driven implementations are of advantage to applications that use non-blocking (i.e, asyn. communication) sends and receices. On the other hand, interrupts are relatively expensive, so performance may be worse, particularly if many short messages are communicated. Long messages generally suffer much less, as they typically trigger just one interrupt. Here is the performace data for the default implementation using polling: . message latency, measured from the time the source task issues an MP_BSEND of a zero length message until the destination task returns from a previous issued matching MP_BRECV is 30 microsecond . this latency is almost entirely account for by software path length (ie. network latency is negligible) half the latency is accounted for by the sending side, and half by the receiving side. Thus an MP_BSEND of a zero byte message typically consumes some 15 miscroseconds. . Effective bandwidth is about 8.5 MBytes/s (0.12 microseconds/byte). While data is being transferred at the maximum rate, both the sending processor and the receiving processor are fully occupied (i.e., there is no overlap). . Thus, to the first approximation, message transport time is 50 microsecond plus 0.12 microsecond per byte transferred. . Internally, each task-to-task virtual communications channel implemented by EUIH has an 8192 byes buffer. Therefore, at most 8192 byets of data will be sent from a given source to a given destination before the destination posts a matching receive. For interrupt implementation: . Interrupts cost approximately 100 miscrosecond each. However, ther interrupt-driven version of EHI-H attempts to minimize the number of interrupts. In particular, interrupts are disabled whenever the EUI-H code is executing; ther are enabled only while user code is executing. So, for example, if a message arrives while a blocking receive(MP_BRECV) is executing, no interrupt will be generated at the destination. The latency of the interrupt-driven EUIH is slightly worse than the 30 microsecond quoted above; several microseconds are required fro enabling and disabling interrupts. 4. HPS on SP2 . hardware specification: 63 usec latency, 35 MB/sec bandwidth when using "user space" option[1] Switch communication using IP . IP communications is the default mode because it allows shared usage of an adapter by multiple processes on a node. Any application may communicate over HPS by opening a standard AIX socket and specifiying appropriate IP addresses. Additoonal system facilities based on IP, such as NFS,AFS, masy also be configured to use the HPS. With the possible exception of application that depend on network-specific functions (LAN broadcasts, for example), most applications work over the HPS without modification. Switch communication in User Space Mode . The new user space communication mode, uCSSCI, is intended for parallel applications that require maximum communications performance, lowest latency and highest bandwidth. to achieve higher performance, however, a node must be dedicated to one parallel application at a time. Only a single AIX proces on a node may use the user space communication library, and nodes must be allocated to the parallel application through the Resource Manager by the Parallel Environment (including MPL,POE,pdbx) or PVMe. The Resource Manager should group nodes into named resource pools to separated user space nodes from shared usage nodes (this is configurated by your SP2 system administrator). Communication Subsystem Support (CSS) The CSS component of AIX PSSP contains HPS adapter diagnostics (both Adapter-1 and Adapter-2), switch initialization and fault handling software, device drive and configuration methods(config/unconfig) and parallel applications programming interfaces. At boot time, CSS configures the HPS adapter including execution of a Power On Self Test (POST Diagnostics). In addition to the POST Diagnostics, a comprehensive set of adapter diagnostics is available on-line. CSS perform switch initialization as part of system startup and automatic switch reconfiguration following a switch fault. SP2 nodes or frames which are not operable are removed from the configuration and reported through the System Data Repository to the Resource Manager and SP System Monitor. Most faults and detected errors are successfully retried transparently to an application. Base support for parallel application task communications is provided by CSS and its Communications Interface (CSS_CI) component. Two communication libraries are provided by CSS-CI: network based and user based. In the network based (or IP) library, UDP/IP calls are made at the kernel level supporting IP communications through Ethernet or HPS adapters. The user based (or user space) native communication library maximizes performance of parallel applications since direct communications occur between the application and the SP2 HPS. No kernel calls are made. The IBM AIX Parallel Environment (including MPL,POE,VT and pdbx) and PVMe directly support this approach. Use of CSS-CI for both IP and user based comunications provides improved application reliability through full message retry and significantly improved performance through low latency task to task communications (less than 40 microsecond) on both switch adapters. Communications bandwidth improvements are aslo possible through the HPS Adapter-2 (> 30 MB/s). 5. Programming interfaces on SP2 There are three layers of programming interfaces for message-passing on SP2 using HPS: 1). the lowerest high performance level MPL is the native message-passing library provided by the vendor. Like similar native libraries on other parallel machines, it is the most flexible and highest performance layer to support explicit message passing on SP2 and HPS. Its sole purpose is to exploit the maximum bandwidth of HPS in a concurrent program, while portability issues to other platforms are not considered. Notice that MPL can still have some kind of portability in a sense that one programming interface can support three different transport protocols (IP on non HPS, IP on HPS, Use Space on HPS) on SP2 with Ethernet,token ring or HPS or a IBM RS/6000 workstation cluster. For applications requiring maximum communication performance, it is recommanded to use MPL with User Space option. 2). the portable intermediate level Standard message passing interfaces are developed either by the vendor or by third parties with the major purpose of achiving highly portable programs across mutiple parallel platforms. Major software of this kind are: . MPI - message-passing interface specified by MPI Forum there are currently two implementations on SP2. MPI-F from IBM and MPI-CH from ANL/MSU . PVM - Parallel Virtual Machine PVMe is the IBM implementation of PVM originally developed at ORNL/UTK. It uses MPL underneath thus can run under the three protocols. . P4 - a portable message-passing package from ANL . Chameleon - another lightweight, portable message-passing system from ANL. It is built on MPL, MPI or P4. Because the portability and a genric message-passing framework is the first consideration, it has more or less performance penalty when ported to a specific platform. 3). the highest level . HPF - on SP2 it uses the data-parallel model on the MIMD architecture to achive concurrent programs with high portable, ease of programming, and implicit message-passing in data parallelism. It often has to sacrifice some degree of performance thus unable to be suitable for general applications and for building software tools. . Fortran-M - a programming framework to support task parallelism and modular programming paradigm in sequential and parallel Fortran programs. 6. MPL (Message Passing Library) on SP2 MPL is the lowest level of parallel message passing library provided by IBM Sp2 as a programming interface for converting a serial Fortran77, C or C++ program into a parallel application on SP2 via subroutine calls. The MPL provides a rich and diverse set of subroutines for coding simple operations, such as task-to-task message passing, as well as the more advanced operations required for highly complex cimmunications. At either compilation or execution, the user specifies whether parallel task communications in MPL can occur via the IP protocol through the Ethernet or HPS adapters (if configured for IP) or via the User Space access to the HPS for maximum performance. Based on user's selection, the POE will link (statistically at compile time or dynamically at runtime) different communication libraries for the MPL program. Like most commonly used message-passing libriaries such as MPI,PVM,P4,CMMD, MPL contains the following major communication primitives: . point to point message passing library . collective communications library (CCL) . synchronous and asynchronous communications MPL (formerly EUI) is IBM's message-passing interface to the HPS. POE is required as the parallel operating environment to control running of MPL programs. Any high-level programming interface that make use of Switch/Us uses POE. What's the difference between IP and US communications? MPL tasks can communicate with each other using either the Internet Protocol (IP) library or with the User Space (US) library. Some differences between these two libraries include: IP communications can occur over ethernet or the high performance switch. US communications can only be conducted over the high performance switch. Only one US task can use the switch adapter on a node. When the task starts, it reserves (dedicates) the adapter to that task only. No other US tasks can use the adapter for the duration of the task which dedicated the adapter. IP permits multiple tasks using IP communications to share the use of the switch adapter on a node simultaneously. Additionally, IP will share the adapter with a US task. IP bandwidth is approximately 7-8 MB/sec. US bandwidth is approximately 31-35 MB/sec. IP is recommended for the interactive nodes where multiple users are using a node at the same time. US is recommended when nodes are allocated by the batch system or when nodes have been reserved for users performing benchmarking. How does MPL perform on the SP2? IP communications over the high performance switch have a bandwidth of approximately 7-8 MB/sec. US communications over the high performance switch have a bandwidth of approximately 31-35 MB/sec. Bandwidth may be affected by competition from other users. 7. PVMe PVMe is a product from IBM which enables PVM users to exploit the User Space communications library for the SP2 high performance switch. This means that PVM codes can approach the performance of MPL codes which run US over the switch. PVM codes do not need to be rewritten, or even recompiled, simply linked to the PVMe libraries instead of the usual PVM libraries. How does PVM perform on the SP2? Although public domain (ORNL) PVM can communicate over the SP2 high performance switch, it can not take advantage of the SP2 User Space communication library. PVM must perform communication by IP protocol. There is a significant difference in bandwidth between these two methods. MPL jobs running with US over the high performance switch can achieve throughput in the 31-35 MB/sec range. PVM running IP over the switch will achieve between 1-2 MB/sec. PVM running IP over ethernet is usually half (or less) of this. 8. MPI (MPI-F - IBM version of MPI, MPI-CH - ANL version) . MPI is a message-passing library interface specification (not a specific implementation). . MPI was designed by a broad-based group of parallel computer vendors, and contains features of other systems. . MPI will eventually be an IBM product (also from other vendors) . There are currently two efficient (use MPL) implementations of MPI available on SPx: . IBM Research (MPI-F) . ANL/MCS (MPICH) 9. Existing Benchmarks for SP2 1) NAS benchmarks for SP2 are available . Kernel benchmarks . Embarrassingly parallel . 3D FFT PDE . Integer Sort . Conjugate Gradient . Multigrid . Simulated Application Benchmarks . LU decomposition . Scalar pentadiagonal . Block tridiagonal 2) Benchmarks for SP2 nodes from IBM 3) HPS benchmarking on SP1/SP2 from ANL A simple "ping-pong" test between two nodes. These tests use the MPL message passing library. The performance of IP over Ethernet, IP over the HPS and Use Space over HPS are measured for both short messages (0-1200 bytes) and long messages (100000-1200000 bytes) 4) Other benchmarking results Performance for MPI message-passing primitives over workstation cluster. Two basic communication performance are tested: point-to-point and collective. Both bandwidth and latency for those communication primitives are measured. 10. Methodologies for benchmarking HPS and SP2 communication subsystem (this is a separate section from this review paper) Based on the above introduction and the message-passing programming interfaces available on SP2, there are four major benchmarking classifications: 1) benmarking at different programming levels Different levels of programming interfaces present to the end-user or software developer with different progamming capabilities in ease of programming, problem expressiveness and declaration, flexibility of control. There are always some trade-offs among implementation performance, programming ease and portability. For a system software developer in areas such as data-mining, choose of message passing libraries on SP2 should weight all the factors in the specific development of the system software. From performance bechmarking point of view, there are two levels of message-pasing libriaries on SP2 must be considered: . the native message-passing library (MPL) This is the lowest level of abstraction of a programmable SP2 provided by the vendor. It should exploit the maximum performance of the underlying hardware and give the highest flexible control over the SP2 system, though less expressive. . the portable explicit message passing interfaces Besides the native one, more and more vendors start to support one or more portable message passing interfaces on their machines, such as MPI,PVM,Linda,Express. In addtional, there are active implementations from third parties of those portable libraries on different platforms. It is preferrable to choose the implementation of the vendor as it should have higher performance value, while the programming interface should be identical. On SP2, MPI-F and PVMe are the two choices. 2) benchmarking different communication primitives Basically, there are two types of communication services in a message-passing library: point-to-point and collective. Point-to-point communications: It involves two communicating processes in the form of various modes of SEND and RECEIVE operations, which may fall into the following catergories: . synchronous and asynchronous send/receive (some called blocking/non-blocking) . synchronous and asynchronous data exchange (e.g., swap,shift,scan,etc) . contiguous and non-contiguous vector data Collective communications: It involves a group of processors to collective perform an atomic (internal synchronized) group operations, such as broadcast, gather, scatter, reduction, muticast, and all-to-all broadcast). Collective communication consists of three subclass: one-to-all, all-to-one, and all-to-all 3) benchmarking with differnt performance-related parameters . message size,buffer size,number of buffers(or messages) . machine size . latency (message size=0) v.s. bandwidth . number of primitive calls . number of repeated iterations, maximum/minimum and average measurements . protocols (IP/Ethernet,IP/Switch,US/Switch,Wide/Thin nodes,etc) 4) benchmarking application-specific communication patterns In most situations, a specific application can find some dominant and typical communication patterns in their message-passing operations. For example, a sort-oriented application using merge-sort operators may require frequent run-time data redistribution on partially sorted data in arbitrary number of dimensions and a given data decomposition scheme. This may need to benchmark the algorithms and techniques developed to minimize the amount of data exchange. With the APT data-mining applications, certain operations identified by APT/NPAC can be benchmarked before the actual development effort to provide performance input for the system achitects to compare with and make decisions to different algorithmic approaches, while they can be implemented using the performance results from 1)-3) above. References: [1] W. Gropp and E. Lusk, "Users Guide for the ANL IBM SPx", ANL/MCS-TM-199, Dec., 1994. [2] Web pages from Cornell Theory Center (CTC) "http://www.tc.cornell.edu", May, 1995 [3] Web pages from IBM POWERparallel Systems Server "http://lscftp.kgn.ibm.com:80/pps/", May, 1995 [3] Web pages from Argonne National Laboratory (ANL) "http://www.mcs.anl.gov/", May, 1995 [4] Web pages from Maui High Performance Computing Center "http://www.mhpcc.edu/mhpcc.html",May, 1995 [5] P. Hochschild, "EUIH: An Experimental EUI Implementation (preliminary) Version 1.06.3), Internal Implementation Notes, IBM T. J. Watson Research Center, Sept., 1993. [6] C. B. Stunkel, D. G. Shea, etc. "The SP2 Communication Subsystem", Technical Report, IBM T.J. Watson Research Center, August, 1994. [7] IBM 9076 Scalable POWERparallel Systems, SP2 Administration Guide SH26-2486-01, 1995. [8] Natawut Nupairoj and L.M. Ni, "Benchmarking of Multicast Communication Services," Technical Report, Dept. of Computer Science, MSU, April, 1995. [8] Natawut Nupairoj and L.M. Ni, "Performance Evaluation of Some MPI Implementations on Workstation Clusters', Technical Report, Dept. of Computer Science, MSU, Dec. 1994.