SP2 High Performance Switch Overview

The basic architecture of the SP2, in terms of parallel applications, is a distributed memory, message passing parallel processor. It was felt that a shared memory paradigm would not be scalable, (and maintain performance), to the large numbers of nodes required by high end users addressing problems such as the Grand Challenge applications.

Overview

The SP2 is joined to outside applications, data servers, archivers, high speed networks and other services. The external connections will consist of open communications standards such as Ethernet, FDDI, FCS, ATM, etc.

The switch provides the internal message passing fabric that connects all of the SP2 nodes (processors) together in a way that potentially allows all processors to be sending messages simultaneously. The hardware to support this connectivity consists of 2 basic elements. The switch board and the communications adapter (High Performance Switch Adapter versions 1 and 2). There is one adapter per node and one switch board unit per rack. The switch board unit contains 8 logical switch chips, with 16 physical chips for reliablility reasons that are discussed later and provides the connectivity of each of the nodes to the switch fabric as well as the rack-to-rack connectivity.

Switch Fabric Objectives

As a start, the switch fabric needs to be scalable from tens up to thousands of nodes. In order to meet that objective a multistage network was chosen. The multistage network increases the amount of switch capability in a granular fashion as the number of processors grows. With this switch topology switch stages are added as the system grows to keep the amount of bandwidth available to each of the processors constant.

One measure of bandwidth that is useful for comparing machine designs is bisectional bandwidth. This is the most common measure of aggregate bandwidth for parallel machines and is loosely defined as follows: Define a plane which separates the parallel system into two parts containing an equal number of nodes. This plane intersects some number of network links. Bisectional bandwidth is the total possible bandwidth crossing this plane through these links.

This term is often used to assess the scalability of a topology. For crossbars , hypercubes, the SP2 Switch, and most multistage networks, bisectional bandwidth scales linearly with the number of nodes in the system. For a 2-dimensional mesh bisectional bandwidth scales with the square root of the number of nodes. For a ring, bisectional bandwidth remains constant as nodes are added. Since the effective bandwidth per node for this measurement is the aggregate divided by the number of nodes, the mesh and ring provide reduced capability as the system grows. The SP2 system maintains constant bisectional bandwidth per processor independent of the size of the machine.

For fine-grain parallel applications, support for short messages with low latency and minimal message overhead is needed. A PIO short message capability was developed for this requirement: the processor can dump the message that it wants to send directly into the fabric and it goes to the other side. For long messages a DMA capability is needed: the processor can set up the transfer, go back to work and then the message will be transferred.

Another critical characteristic of the system was multi-user support. A system that has a large number of nodes should be able to be subdivided at times to support many users. To support multiple users and because the low latency fine grain requirement demands that messages must come from user space (rather than with kernel calls) hardware protection between partitions or between jobs is required. The protection is direct hardware protection, permitting latencies to decrease over time but keep the functional characteristics intact.

Another requirement for a multiuser system is fairness of message delivery. That is, one job can not monopolize the fabric. We need to have a multiuser system where we can have multiple jobs simultaneously running and guarantee that they all make progress. For that reason we picked a Packet Switched network.

Message flow control methods are commonly divided into circuit-switched and packet-switched methods. In circuit-switching flow-control, messages traverse a previously configured (and reserved) circuit in the network. A control packet or packets completes the configuration. The circuit remains reserved for transmitting message packets until it is torn down (unreserved) by another control packet. In some implementations the tail of a message packet unreserves the circuit.

Circuit switch systems have two characteristics that we found undesirable. First, the circuit is reserved through the entire network for the duration of the message. This uses up more of the communication bandwidth than is necessary. Second, and more important, the circuit switch conflict resolution mechanism is to have the node losing the arbitration wait some period of time before retrying the connection. This has the potential of starving a particular node in a high traffic environment and was deemed absolutely unacceptable.

In packet-switching protocols, packets are self-routing. Each message packet contains a header that incorporates the routing information, and each switch element interprets this routing information to select the proper output port for that switch. Packet-switching techniques can be further characterized based upon when and how the packets are forwarded between switches. Many distributed networks use a method known as store-and-forward, in which a switch element receives the entire packet before attempting to forward the packet to the next switch on the path to the destination node. Store-andforward packet latency is therefore dependent upon the packet length multiplied by the path length.

Cut-through packet-switching methods were devised to mitigate the effect of path length on total packet latency. In cut-through networks, the switch element examines the packet header's route information to determine the proper output port and immediately forwards the packet if the output port is not busy. In the event that that output port is busy, the packet is blocked and must be buffered in some fashion.

Cut-through methods can be further classified according the selected method of packet buffering. Virtual cut-through methods require each switch element to either forward or buffer the entire packet. Thus, like store-and-forward, virtual cut-through requires switches to contain buffers large enough to store at least one packet, and flow control is packet based.

Wormhole routing is a more common form of cut-through, and is the method chosen for the SP2 Switch Unlike virtual cut-through, switch elements utilizing wormhole routing need not buffer an entire packet when that packet cannot proceed. Instead, buffering occurs on sub-packet basis. For instance, in the SP2 Switch, buffering occurs on a byte basis. As a result, blocked packets may be stored across multiple switch elements if the switch element that received the blocked packet header cannot store the entire packet in its own buffers.

Wormhole routing switch elements commonly contain buffers at each input port to minimize the number of switches required to completely buffer a blocked packet, thereby minimizing the number of communication links reserved for that packet. In addition, the SP2 switch element contains a unique central buffer called the central queue which dynamically allocates space for blocked packet bytes from any input port. The use of a central buffer allows an input port receiving more than its share of blocked packets to utilize more than its share of total buffer space inside the switch element. To highlight the use of the central queue, we have termed our flow control method buffered wormhole routing.

To summarize, the particular network we have is a multi-stage, omega, buffered-wormhole routing packet-switch. If a message is traveling from node A to node B and another message needs to intersect it to go through from processors C to D, the messages can cut through each other and share the fabric. There is no blockage of one message by another message for long periods of time. If there is congestion on an outgoing path there is buffering on each of the chips such that messages are buffered and queued within the switch chip for fair delivery in a round-robin fashion, packet by packet.

Ease of job scheduling placed another requirement of the switch fabric: a uniform topology. The SP2 switch topology is uniform, which is to say that the fabric as an omega network has equidistant message traffic from any particular point in the fabric to any other point in the fabric. Algorithms do not place requirements on node selection or topology since all nodes are equidistant. This means that the scheduler doesn't need to worry about the physical location of the specific nodes selected for a job. I/O has the same flexibility: I/O server nodes can be located anywhere on the fabric.

To obtain an efficient message-passing implementation we have designed "optimistic" protocols where the data is sent assuming that it will get there and there will be buffers. Rendezvous protocols also exist for very long messagesin which an application can reserve space and then send the message at a later time.

The fabric must be reliable as viewed by the application. That is, the application user shouldn't need to worry about retransmitting messages. The architecture provides end-to-end message protection. Checking codes are placed on messages which are used by the receiving node to check for errors.

An adapter-to-adapter acknowledge protocol architecture is used. using positive acknowledgement: if a message is sent from A to B the acknowledgement that it has been received is sent back from B to A. Until the acknowledgement is received, the communication subsystem owns the location of the message.

The multi-stage fabric can also provide redundant paths between nodes. This allows faulty components to be eliminated from the resource list for the route generator, and routes to be generated even when there are faulty components in the system.

The nodes are omega cross coupled within the board so that there are multiple paths inside of this fabric to the outside. Between each of the frames there are also 4 cables in a 4 frame configuration. And while they are shown as logically connected to the same chip they are physically on different chips so that even a single chip failure does not isolate two racks from each other much less two nodes. There are four paths from for instance, a node on the left side to a node on the right side and they will come with over four wires so four physical cables or four physical switch chips would have to fail before the nodes are isolated.

For switch clocking, the entire fabric, switch boards and adapters, are frequency synchronous. They receive clocking from the same master oscillator but are not phase synchronized. That is, precisely measured cables are not required. The phase synchronization is achieved by inserting variable delay in the paths based on initialization measurements.

Since there is one oscillator driving the entire system it could be a single point of system failure. To eliminate this possibility a redundant clock tree under System Monitor/Supervisor System control was implemented.

The list of characteristics outlined fulfills the objectives of the scalable parallel family. The switch continues to evolve as do the other components of the POWERparallel Systems (processors, software, I/O, etc.) to provide a consistent growth path for the parallel applications.


Copyright 1994 IBM Corporation