The High Performance Switch and its Programming Interfaces on IBM SP2

		     Gang Cheng, Marek Podgorny
		     (gcheng,marek@npac.syr.edu)
	
		Northeast Parallel Architectures Center
			Syracuse University

1. Introduction

High performance switch (HPS) is the central component of IBM 9076 SP2 
system, which enables the high-bandwidth and low-latency communication subsystem
on SP2. Based on our understanding and experience of using
SP2, and collections of materials from other references, this report 
describes the HPS and programming interfaces built on it.
We focus on the transport protocols, message passing libraries, programming
interfaces and software layers, as well as hardware configurations of
HPS. This is a review report whose major purpose is to provide as informative
as possible materials for software developers to have a general yet
detailed picture about HPS and parallel programming environments available
on SP2. This report will also serve as a base understanding of SP2 for
NPAC to design technical strategies of benchmarking the SP2 communication
subsystems for APT's data mining software and applications. 

Some internals of HPS hardware configuration and its supporting software
are given to help understanding the HPS from the architectural point of view.

2. Background information about HPS on SP1 and SP2

IBM SPx has experienced several significant upgrades during the past two
years since its introduction to the market in 1992. You may notice
different terms which is somehow confusing as it may only be
valid to one of the SPx platforms. Here is what we found:

SP1: EUI(TB0),EUIH(also referred as MPL/p),HPS Adapter-1
SP2: MPL,HPS Adapter-2,User Space(uCSSCI)

(1) Terms used for SP1
. EUI is the reference name of the transport protocol on HPS of SP1. 
It was also the mnemonic for the first release of MPL that was available 
on the SP1. EUI-H is an experimental upgrade of EUI that later underwent a 
name change to become MPL in the SP2 environment. EUI-H was also called the 
light-weight or lightspeed EUI and is a research tool which was later 
discontinued by IBM on SP2. HPS Adapter-1 was shipped with SP1.

(1) Terms used for SP2
. MPL (Message Passing Library) is the component of the IBM AIX
Parallel Environment that enables the user to write parallel applications.
As its name indicates, the MPL routines enable message passing between
individual tasks of a parallel application for communication of data and
synchronization of the tasks operations. The enhanced HPS Adapter-2 is new
to SP2 system. The two commuications modes supported on SP2 and in MPL
are AIX sockets with TCP/IP, and a new high performance user space library. 

We will describe the major differences between HPS on SP2 and SP1, and
the transport protocols supported in later sections.

3. Hardware Configuration Features of HPS on SP2

HPS on SP2 provides the internal(hardwire) message passing fabric that
connects all of the SP processors together to allow 
any-to-any internode connection and all processors to
send messages simultaneously. It is a packet-switched network 
(vs. circuit-switched) and a multistage network which can add switches as 
system is scaled upward. The peak bandwidth of overall HPS is 
40 MBits/s in bi-direction. It support for multi-user environment - 
multiple jobs may run simultaneously over the switch (one user does 
not monopolize switch). With built-in error detection, it supports 
path redundancy that permits routine to be generated even when there are
faulty components (eg. nodes) in the system.

When HPS on SP2 is referred in general, we are actually talking about 
three major components of the HPS: Switch Adapters, Communication
Links, and a Switch Board, as illustrated in Figure 1 for the interconnection 
of two nodes. They perform different functions in node-network and
network-network(switching) communications.
They have different peak bandwidths and constraints
due to electronic or physical configuration limitations. The overall 
40 MB/s HPS peak bandwidth is the minmum one(s) of the three, namely, 
the communication links and the Switch board.


Figure 1. Diagram of HPS components connecting two nodes


-----------		     40 MB/s			  40 MB/s      	          -----------
.         .-------------  communication .------------. communication -------------.         . 
. Node 1  .HPS Adapter .<-------------->.Switch Board.<------------->.HPS Adapter . Node 1  .
.(CPU,RAM).-------------      link 	.------------.    link	     -------------.(CPU,RAM).
.         . 20-80 MB/s		          40 MB/s			20-80 MB/s.         .
-----------									  -----------


Communication Link:
All communication in SP2 uses point-to-point bidirectional communication links
that consist of two channels, each carrying data in opposite directions to their
respective output port and input port pairs. For each channel. the actual transmission
is via a set of 10 signal lines, 8 for data, one for tag and one for the token.
A major enhancement to the SP2 Switch is the incorporation of clock distribution
into the switch communication links. In the SP1 Switch, clock distribution is 
accomplished via a separate clock redrive network. In SP2, each link is addiotnally capable
of carrying redriven clock signals in both directions.

Switch Adapters:

The SP2 switch sdapter card connects a processor's Micro Channel to (both) output
and input ports of the switch network via an ASIC chip called the MSMU (Memory & Switch
Management Unit). 

two versions: HPS Adapter-1 (40 MB/s), HPS Adapter-2 (80MB/s). 
One adapter per SP node.

SP2 networks are bidirectional multistage interconnection networks, in which
each communication link comprises two channels which carry data
inopposite direction. For all SP configurations, there are at least FOUR
usable paths between each pair of nodes, except for node pairs that are 
directly attached to the same switch element. 

Switch board:
A switch board contains 8 logical switch chips with 16 
physical chips for reliablity reasons. One switch board 
per SP frame. The switch operates at 40 MHz,
providing peak bandwidth of 40 Mbyte/s over both byte-wide 
channels of each communication link. It has 500 ns hardware latency.

. protocols
  . IP (Internet Protocol) - default; permits shared usage of HPS-1 or 
	HPS-2 adapter by multiple processes
  . US CSS (User Space Communication Subsystem) - intended for parallel
        applications that require maximum communications performance. 
        Only one process per node may use US communications. Any
        programming interface that makes use of Switch/Us uses POE.
  . LightSpeed (i.e., EUI-H on SP1, disconnected)

. HPS Adapter-2 enhancements
  . Onboard microprocessor to offload protocol processing
    from CPU - reduces communication overhead
  . error checking - message CRC generation and checking (on board)
  . switch failure recovery
  . uses multiplexing to permit simultaneous IP and US communications on 
    same node. HPS-1 requires that adapter be dedicated to on or
    the other - not both.


HPS on SP1
EUI and EUIH:
EUIH is an experimental implementation of the EUI providing
. high performance (bandwidth and latency)
. high reliability (fault detection, network fault recovery)
. partition management for multiple users
. single parallel task per compute node

The Major performance difference between EUIH and EUI is
the improved mesage-passing latency, as described in [4] below:

bechmark data of EUIH from IBM:
. RS-6000 model 370 with TB-0 adapters and HPS
. there are two options in EUIH implementation which may have different
message-pssing latency: polling or interrupting. The interrupt-driven
implementations are of advantage to applications that use non-blocking
(i.e, asyn. communication)
sends and receices. On the other hand, interrupts are relatively
expensive, so performance may be worse, particularly if many short
messages are communicated. Long messages generally suffer much less, as
they typically trigger just one interrupt.

Here is the performace data for the default implementation using polling:
. message latency, measured from the time the source task
issues an MP_BSEND of a zero length message until the 
destination task returns from a previous issued matching MP_BRECV is
30 microsecond
. this latency is almost entirely account for by software path length 
(ie. network latency is negligible)
half the latency is accounted for by the sending side, and half
by the receiving side. Thus an MP_BSEND of a zero byte message
typically consumes some 15 miscroseconds.
. Effective bandwidth is about 8.5 MBytes/s (0.12 microseconds/byte).
While data is being transferred at the maximum rate, both the
sending processor and the receiving processor are fully occupied
(i.e., there is no overlap). 
. Thus, to the first approximation, message transport time is 50 microsecond
plus 0.12 microsecond per byte transferred.
. Internally, each task-to-task virtual communications channel implemented
by EUIH has an 8192 byes buffer. Therefore, at most 8192 byets of data
will be sent from a given source to a given destination before the
destination posts a matching receive.
 
For interrupt implementation:
. Interrupts cost approximately 100 miscrosecond each. However, ther interrupt-driven 
version of EHI-H attempts to minimize the number of interrupts. In particular, interrupts 
are disabled whenever the EUI-H code is executing; ther are enabled only while
user code is executing. So, for example, if a message arrives while a blocking
receive(MP_BRECV) is executing, no interrupt will be generated at the destination.
The latency of the interrupt-driven EUIH is slightly worse than the 30
microsecond quoted above; several microseconds are required fro enabling and 
disabling interrupts.


4. HPS on SP2
. hardware specification: 
  63 usec latency, 35 MB/sec bandwidth when using "user space" option[1]

Switch communication using IP
. IP communications is the default mode because it allows shared usage of an
adapter by multiple processes on a node. Any application may communicate
over HPS by opening a standard AIX socket
and specifiying appropriate IP addresses. Additoonal system facilities based
on IP, such as NFS,AFS, masy also be configured to use the HPS. With 
the possible exception of application that depend on network-specific
functions (LAN broadcasts, for example), most applications work over the
HPS without modification.

Switch communication in User Space Mode
. The new user space communication mode, uCSSCI, is intended for parallel
applications that require maximum communications performance,
lowest latency and highest bandwidth. to achieve higher performance,
however, a node must be dedicated to one parallel application at a time. 
Only a single AIX proces on a node may use the user space communication library, and 
nodes must be allocated to the parallel application through the Resource Manager by 
the Parallel Environment (including MPL,POE,pdbx) or PVMe. The Resource
Manager should group nodes into named resource pools to separated
user space nodes from shared usage nodes (this is configurated by your
SP2 system administrator).

Communication Subsystem Support (CSS)
The CSS component of AIX PSSP contains HPS adapter diagnostics
(both Adapter-1 and Adapter-2), switch initialization and fault handling
software, device drive and configuration methods(config/unconfig) and parallel
applications programming interfaces. At boot time, CSS configures the
HPS adapter including execution of a Power
On Self Test (POST Diagnostics). In addition to the POST Diagnostics, a
comprehensive set of adapter diagnostics is available on-line.

CSS perform switch initialization as part of system startup and automatic switch
reconfiguration following a switch fault. SP2 nodes or frames which are not operable
are removed from the configuration and reported through the System Data
Repository to the Resource Manager and SP System Monitor. Most faults
and detected errors are successfully retried transparently to an application.

Base support for parallel application task communications is provided
by CSS and its Communications Interface (CSS_CI) component. Two
communication libraries are provided by CSS-CI: network based and user based.

In the network based (or IP) library, UDP/IP calls are made at the
kernel level supporting IP communications through Ethernet or HPS
adapters.

The user based (or user space) native communication library maximizes 
performance of parallel applications since direct communications occur
between the application and the SP2 HPS. No kernel calls are made. The IBM AIX 
Parallel Environment (including MPL,POE,VT and pdbx) and PVMe directly
support this approach. Use of CSS-CI for both IP and user based
comunications provides improved application reliability through full message
retry and significantly improved performance through low
latency task to task communications (less than 40 microsecond) on both 
switch adapters. Communications bandwidth improvements are aslo
possible through the HPS Adapter-2 (> 30 MB/s).


5. Programming interfaces on SP2

There are three layers of programming interfaces for message-passing
on SP2 using HPS:

1). the lowerest high performance level
MPL is the native message-passing library provided by the vendor. Like
similar native libraries on other parallel machines, it is the most
flexible and highest performance layer to support explicit
message passing on SP2 and HPS. Its sole purpose is to exploit the maximum
bandwidth of HPS in a concurrent program, while portability issues
to other platforms are not considered. Notice that MPL can still
have some kind of portability in a sense that one programming interface can 
support three different transport protocols (IP on non HPS, IP on HPS, 
Use Space on HPS) on SP2 with Ethernet,token ring or HPS or a IBM RS/6000 
workstation cluster. For applications requiring maximum communication 
performance, it is recommanded to use MPL with User Space option.

2). the portable intermediate level
Standard message passing interfaces are developed either by the vendor
or by third parties with the major purpose of achiving highly portable
programs across mutiple parallel platforms. Major software of this kind
are:
. MPI - message-passing interface specified by MPI Forum
  there are currently two implementations on SP2. MPI-F from IBM
  and MPI-CH from ANL/MSU
. PVM - Parallel Virtual Machine
  PVMe is the IBM implementation of PVM originally developed at ORNL/UTK.
  It uses MPL underneath thus can run under the three protocols.
. P4 - a portable message-passing package from ANL
. Chameleon - another lightweight, portable message-passing system from
       ANL. It is built on MPL, MPI or P4.

Because the portability and a genric message-passing framework 
is the first consideration, it has more or less performance penalty
when ported to a specific platform.

3). the highest level
. HPF - on SP2 it uses the data-parallel model on the MIMD architecture
to achive concurrent programs with high portable, ease of programming, 
and implicit message-passing in data parallelism. It often has to sacrifice
some degree of performance thus unable to be suitable for general applications
and for building software tools.

. Fortran-M - a programming framework to support task parallelism and
 	      modular programming paradigm in sequential and parallel
	      Fortran programs.

6. MPL (Message Passing Library) on SP2

MPL is the lowest level of parallel message passing library provided
by IBM Sp2 as a programming interface for converting a serial Fortran77, 
C or C++ program into a parallel application on SP2 via subroutine calls.
The MPL provides a rich and diverse set of subroutines for coding simple 
operations, such as task-to-task message passing, as well as
the more advanced operations required for highly complex cimmunications.

At either compilation or execution, the user specifies whether parallel
task communications in MPL can occur via
the IP protocol through the Ethernet or HPS adapters (if configured
for IP) or via the User Space access to the HPS for maximum performance.
Based on user's selection, the POE will link (statistically
at compile time or dynamically at runtime) different communication
libraries for the MPL program.

Like most commonly used message-passing libriaries such as MPI,PVM,P4,CMMD,
MPL contains the following major communication primitives:

 . point to point message passing library
 . collective communications library (CCL)
 . synchronous and asynchronous communications

MPL (formerly EUI) is IBM's message-passing interface to the HPS. POE
is required as the parallel operating environment to control
running of MPL programs. Any high-level programming interface that
make use of Switch/Us uses POE.

What's the difference between IP and US communications?

     MPL tasks can communicate with each other using either the Internet
     Protocol (IP) library or with the User Space (US) library. Some
     differences between these two libraries include:
           IP communications can occur over ethernet or the high performance
          switch. US communications can only be conducted over the high
          performance switch.
           Only one US task can use the switch adapter on a node. When the task
          starts, it reserves (dedicates) the adapter to that task only. No
          other US tasks can use the adapter for the duration of the task which
          dedicated the adapter.
           IP permits multiple tasks using IP communications to share the use
          of the switch adapter on a node simultaneously. Additionally, IP will
          share the adapter with a US task.
           IP bandwidth is approximately 7-8 MB/sec. US bandwidth is
          approximately 31-35 MB/sec.
           IP is recommended for the interactive nodes where multiple users are
          using a node at the same time. US is recommended when nodes are
          allocated by the batch system or when nodes have been reserved for
          users performing benchmarking.

How does MPL perform on the SP2?

     IP communications over the high performance switch have a bandwidth of
     approximately 7-8 MB/sec. US communications over the high performance
     switch have a bandwidth of approximately 31-35 MB/sec.

     Bandwidth may be affected by competition from other users.

7. PVMe

     PVMe is a product from IBM which enables PVM users to exploit the User
     Space communications library for the SP2 high performance switch. This
     means that PVM codes can approach the performance of MPL codes which run
     US over the switch. PVM codes do not need to be rewritten, or even
     recompiled, simply linked to the PVMe libraries instead of the usual PVM
     libraries.

  How does PVM perform on the SP2?

     Although public domain (ORNL) PVM can communicate over the SP2 high
     performance switch, it can not take advantage of the SP2 User Space
     communication library. PVM must perform communication by IP protocol.
     There is a significant difference in bandwidth between these two methods.
     MPL jobs running with US over the high performance switch can achieve
     throughput in the 31-35 MB/sec range. PVM running IP over the switch will
     achieve between 1-2 MB/sec. PVM running IP over ethernet is usually half
     (or less) of this.

8. MPI (MPI-F - IBM version of MPI, MPI-CH - ANL version)

. MPI is a message-passing library interface specification (not a specific
implementation).
. MPI was designed by a broad-based group of parallel computer vendors, and
contains features of other systems.
. MPI will eventually be an IBM product (also from other vendors)
. There are currently two efficient (use MPL) implementations of MPI
available on SPx:
  . IBM Research (MPI-F)
  . ANL/MCS (MPICH)

9. Existing Benchmarks for SP2

1) NAS benchmarks for SP2 are available
. Kernel benchmarks
  . Embarrassingly parallel
  . 3D FFT PDE
  . Integer Sort
  . Conjugate Gradient
  . Multigrid
. Simulated Application Benchmarks
  . LU decomposition
  . Scalar pentadiagonal
  . Block tridiagonal

2) Benchmarks for SP2 nodes from IBM
3) HPS benchmarking on SP1/SP2 from ANL
   A simple "ping-pong" test between two nodes. These tests use the MPL message
passing library. The performance of IP over Ethernet, IP over the HPS and Use Space
over HPS are measured for both short messages (0-1200 bytes) and long messages
(100000-1200000 bytes)
4) Other benchmarking results
   Performance for MPI message-passing primitives over workstation cluster.
Two basic communication performance are tested: point-to-point and collective.
Both bandwidth and latency for those communication primitives are measured.

10. Methodologies for benchmarking HPS and SP2 communication subsystem
(this is a separate section from this review paper)

Based on the above introduction and the message-passing programming interfaces
available on SP2, there are four major benchmarking classifications:

1) benmarking at different programming levels
   Different levels of programming interfaces present to the end-user or software
developer with different progamming capabilities in ease of programming, 
problem expressiveness and declaration, flexibility of control. There are always
some trade-offs among implementation performance, programming ease and portability. 
For a system software developer in areas such as data-mining, choose of message passing
libraries on SP2 should weight all the factors in the specific development of the
system software. 

   From performance bechmarking point of view, there are two levels of message-pasing
libriaries on SP2 must be considered: 

. the native message-passing library (MPL)
  This is the lowest level of abstraction of a programmable SP2 provided by the
vendor. It should exploit the maximum performance of the underlying hardware
and give the highest flexible control over the SP2 system, though less expressive.

. the portable explicit message passing interfaces
  Besides the native one, more and more vendors start to support one or more
portable message passing interfaces on their machines, such as MPI,PVM,Linda,Express.
In addtional, there are active implementations from third parties of those
portable libraries on different platforms. It is preferrable to choose the
implementation of the vendor as it should have higher performance value, while
the programming interface should be identical. On SP2, MPI-F and PVMe are the
two choices.

2) benchmarking different communication primitives

Basically, there are two types of communication services in a message-passing
library: point-to-point and collective.

Point-to-point communications:

It involves two communicating processes in the form of various modes of SEND
and RECEIVE operations, which may fall into the following catergories:

. synchronous and asynchronous send/receive (some called blocking/non-blocking)
. synchronous and asynchronous data exchange (e.g., swap,shift,scan,etc)
. contiguous and non-contiguous vector data

Collective communications:
It involves a group of processors to collective perform an atomic (internal
synchronized) group operations, such as broadcast, gather, scatter, reduction,
muticast, and all-to-all broadcast). Collective communication consists of
three subclass: one-to-all, all-to-one, and all-to-all

3) benchmarking with differnt performance-related parameters
. message size,buffer size,number of buffers(or messages)
. machine size
. latency (message size=0) v.s. bandwidth
. number of primitive calls
. number of repeated iterations, maximum/minimum and average measurements
. protocols (IP/Ethernet,IP/Switch,US/Switch,Wide/Thin nodes,etc)

4) benchmarking application-specific communication patterns
In most situations, a specific application can find some dominant and typical
communication patterns in their message-passing operations. For example,
a sort-oriented application using merge-sort operators may require frequent 
run-time data redistribution on partially sorted data in arbitrary number of dimensions
and a given data decomposition scheme. This may need to benchmark the algorithms
and techniques developed to minimize the amount of data exchange.

With the APT data-mining applications, certain operations identified by APT/NPAC
can be benchmarked before the actual development effort to provide performance
input for the system achitects to compare with and make decisions to different
algorithmic approaches, while they can be implemented
using the performance results from 1)-3) above.


References:

[1] W. Gropp and E. Lusk, "Users Guide for the ANL IBM SPx", ANL/MCS-TM-199,
Dec., 1994.
[2] Web pages from Cornell Theory Center (CTC) "http://www.tc.cornell.edu", 
May, 1995
[3] Web pages from IBM POWERparallel Systems Server 
"http://lscftp.kgn.ibm.com:80/pps/", May, 1995
[3] Web pages from Argonne National Laboratory (ANL) "http://www.mcs.anl.gov/",
May, 1995
[4] Web pages from Maui High Performance Computing Center 
"http://www.mhpcc.edu/mhpcc.html",May, 1995
[5] P. Hochschild, "EUIH: An Experimental EUI Implementation (preliminary)
Version 1.06.3), Internal Implementation Notes, IBM T. J. Watson
Research Center, Sept., 1993.
[6] C. B. Stunkel, D. G. Shea, etc. "The SP2 Communication Subsystem",
Technical Report, IBM T.J. Watson Research Center, August, 1994.
[7] IBM 9076 Scalable POWERparallel Systems, SP2 Administration Guide
SH26-2486-01, 1995.
[8] Natawut Nupairoj and L.M. Ni, "Benchmarking of Multicast Communication Services,"
Technical Report, Dept. of Computer Science, MSU, April, 1995.
[8] Natawut Nupairoj and L.M. Ni, "Performance Evaluation of Some MPI
Implementations on Workstation Clusters', Technical Report, Dept. of 
Computer Science, MSU, Dec. 1994.