Full HTML for

Scripted foilset CPS615-Lecture on Computer Architectures and Networks

Given by Geoffrey C. Fox at Delivered Lectures of CPS615 Basic Simulation Track for Computational Science on 12 September 96. Foils prepared 11 November 1996
Outside Index Summary of Material


This continues the computer architecture discussion with
  • MIMD and SIMD with distributed shared memory
  • MetaComputers
  • Special Purpose Architectures
  • Granularity with technological changes forcing larger process sizes
  • Overview of Communication Networks with
    • Switches versus topologies versus buses
    • Typical values in today's machines

Table of Contents for full HTML of CPS615-Lecture on Computer Architectures and Networks

Denote Foils where Image Critical
Denote Foils where HTML is sufficient
Indicates Available audio which is lightpurpleed out if missing
1 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 12 - 1996

2 Abstract of Sept 12 1996 CPS615 Lecture
3 Parallel Computer Architecture Control Structure
4 Some Major Hardware Architectures - MIMD
5 MIMD Distributed Memory Architecture
6 Some Major Hardware Architectures - SIMD
7 SIMD (Single Instruction Multiple Data) Architecture
8 Some Major Hardware Architectures - Mixed
9 Some MetaComputer Systems
10 Comments on Special Purpose Devices
11 The GRAPE N-Body Machine
12 Why isn't GRAPE a Perfect Solution?
13 Granularity of Parallel Components - I
14 Granularity of Parallel Components - II
15 Switch and Bus based Architectures
16 Classes of Communication Networks
17 Examples of Interconnection Topologies
18 Useful Concepts in Communication Systems
19 Communication Performance of Some MPP's
20 Implication of Hardware Performance
21 Latency and Bandwidth of a Network
22 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing
23 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

Outside Index Summary of Material



HTML version of Scripted Foils prepared 11 November 1996

Foil 1 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 12 - 1996

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index
Geoffrey Fox
NPAC
Room 3-131 CST
111 College Place
Syracuse NY 13244-4100

HTML version of Scripted Foils prepared 11 November 1996

Foil 2 Abstract of Sept 12 1996 CPS615 Lecture

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index
This continues the computer architecture discussion with
  • MIMD and SIMD with distributed shared memory
  • MetaComputers
  • Special Purpose Architectures
  • Granularity with technological changes forcing larger process sizes
  • Overview of Communication Networks with
    • Switches versus topologies versus buses
    • Typical values in today's machines

HTML version of Scripted Foils prepared 11 November 1996

Foil 3 Parallel Computer Architecture Control Structure

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 108
SIMD -lockstep synchronization
  • Each processor executes same instruction stream
MIMD - Each Processor executes independent instruction streams
MIMD Synchronization can take several forms
  • Simplest: program controlled message passing
  • "Flags" (barriers,semaphores) in memory - typical shared memory construct as in locks seen in Java Threads
  • Special hardware - as in cache and its coherency (coordination between nodes)

HTML version of Scripted Foils prepared 11 November 1996

Foil 4 Some Major Hardware Architectures - MIMD

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 151.2
MIMD Distributed Memory
  • This is now best illustrated by a collection of computers on a network (i.e. a metacomputer)
MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
  • In near future, ALL formal (closely coupled) MPP's will be distributed shared memory
  • Note all computers (e.g. current MIMD distributed memory IBM SP2) allow any node to get at any memory but this is done indirectly -- you send a message
  • In future "closely-coupled" machines, there will be built in hardware supporting the function that any node can directly address all memory of the system
  • This distributed shared memory architecture is currently of great interest to (a major challenge for) parallel compilers

HTML version of Scripted Foils prepared 11 November 1996

Foil 5 MIMD Distributed Memory Architecture

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 64.8
A special case of this is a network of workstations (NOW's) or personal computers (metacomputer)
Issues include:
  • Node - CPU, Memory
  • Network - Bandwidth, Memory
  • Hardware Enhanced Access to distributed Memory

HTML version of Scripted Foils prepared 11 November 1996

Foil 6 Some Major Hardware Architectures - SIMD

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 256.3
SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
  • Examples are CM-1,2 from Thinking Machines
  • and AMT DAP and Maspar which are currently focussed entirely on accelerating parts of database indexing
  • This architecture is of decreasing interest as has reduced functionality without significant cost advantage compared to MIMD machines
  • Cost of synchronization in MIMD machines is not high!
  • Main interest of SIMD is flexible bit arithmetic as processors "small" but as transistor densities get higher this also becomes less interesting as full function 64 bit CPU's only use a small fraction of silicon of modern computer

HTML version of Scripted Foils prepared 11 November 1996

Foil 7 SIMD (Single Instruction Multiple Data) Architecture

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 108
CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network
Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication

HTML version of Scripted Foils prepared 11 November 1996

Foil 8 Some Major Hardware Architectures - Mixed

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 90.7
Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines.
Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D
This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
  • note that this can be a MIMD collection of SIMD machines if have a set of Maspar's on a network
  • One can think of human brain as a SIMD machine and then a group of people is such a MIMD collection of SIMD processors

HTML version of Scripted Foils prepared 11 November 1996

Foil 9 Some MetaComputer Systems

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 125.2
Cluster of workstations or PC's
Heterogeneous MetaComputer System

HTML version of Scripted Foils prepared 11 November 1996

Foil 10 Comments on Special Purpose Devices

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 254.8
One example is an Associative memory - SIMD or MIMD or content addressable memories
This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures
This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed
If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
  • You have only sped up the system by a factor 1.1 not by 100!

HTML version of Scripted Foils prepared 11 November 1996

Foil 11 The GRAPE N-Body Machine

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 237.6
N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices
See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
  • Essential reason is that such problems need much less memory per floating point unit than most problems
  • Globular Cluster: 10^6 computations per datum stored
  • Finite Element Iteration: A few computations per datum stored
  • Rule of thumb is that one needs one gigabyte of memory per gigaflop of computation in general problems and this general design puts most cost into memory not into CPU.
Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines

HTML version of Scripted Foils prepared 11 November 1996

Foil 12 Why isn't GRAPE a Perfect Solution?

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 231.8
GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations
Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach
  • On one million stars, fast multipole is a factor of 100-1000 faster than GRAPE algorithm
  • fast multipole works in most but not all N-body problems (in globular clusters, extreme heterogenity makes direct O(N^2) method most attractive)
So special purpose devices cannot usually take advantage of new nifty algorithms!

HTML version of Scripted Foils prepared 11 November 1996

Foil 13 Granularity of Parallel Components - I

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 207.3
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors.
  • Pieces, processors may be heterogeneous. Computation/
  • Communication ratio very high -- Typical of Networked Metacomputing
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors.
  • Processors typically run the same code.(SPMD Style)
  • Computation/communication ratio often hundreds or more.
  • Typical of MIMD Parallel Systems such as SP2, CM5, Paragon, T3D

HTML version of Scripted Foils prepared 11 November 1996

Foil 14 Granularity of Parallel Components - II

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 388.8
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines.
  • Processors often have instructions broadcasted to them.
  • Computation/ Communication ratio often near unity.
  • Typical of SIMD but seen in a few MIMD systems such as Kogge's Execube, Dally's J Machine or commercial Myrianet (Seitz)
  • This is going to be very important in future petaflop architectures as the dense chips of year 2003 onwards favor this Processor in Memory Architecture
  • So many "transistors" in future chips that "small processors" of the "future" will be similar to todays high end microprocessors
  • As chips get denser, not realistic to put processors and memories on separate chips as granularities become too big
Note that a machine of given granularity can be used on algorithms of the same or finer granularity

HTML version of Scripted Foils prepared 11 November 1996

Foil 15 Switch and Bus based Architectures

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 72
Switch
Bus

HTML version of Scripted Foils prepared 11 November 1996

Foil 16 Classes of Communication Networks

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 146.8
The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together.
Bus: All processors (and memory) connected to a common bus or busses.
  • Memory access fairly uniform, but not very scalable due to contention
  • Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by Bus. The Bus accessed memories can be local memories on other processors
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
  • Switches might have queues and "combining logic", which improve functionality but increase latency.
  • Switch settings may be determined by message headers or preset by controller.
  • Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
  • Usually NUMA, blocking, often scalable and upgradable

HTML version of Scripted Foils prepared 11 November 1996

Foil 17 Examples of Interconnection Topologies

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 260.6
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial

HTML version of Scripted Foils prepared 11 November 1996

Foil 18 Useful Concepts in Communication Systems

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 90.7
Useful terms include:
Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)
Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault?
Blocking: Some communication requests may not get through, due to conflicts caused by other requests.
Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.
Latency (delay): Maximal time for nonblocked request to be transmitted.
Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.
Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets

HTML version of Scripted Foils prepared 11 November 1996

Foil 19 Communication Performance of Some MPP's

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 162.7
From Aspects of Computational Science, Editor Aad van der Steen, published by NCF
System Communication Speed Computation Speed
    • Mbytes/sec(per link) Mflops/sec(per node)
IBM SP2 40 267
Intel iPSC860 2.8 60
Intel Paragon 200 75
Kendall Square
KSR-1 17.1 40
Meiko CS-2 100 200
Parsytec GC 20 25
TMC CM-5 20 128
Cray T3D 150 300

HTML version of Scripted Foils prepared 11 November 1996

Foil 20 Implication of Hardware Performance

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 87.8
tcomm = 4 or 8 /Speed in Mbytes sec
  • as 4 or 8 bytes in a floating point word
tfloat = 1/Speed in Mflops per sec
Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed
tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively
Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems!

HTML version of Scripted Foils prepared 11 November 1996

Foil 21 Latency and Bandwidth of a Network

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 168.4
Transmission Time for message of n bytes:
T0 + T1 n where
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize
T0 = TS + Td . Number of hops
T1 is the inverse bandwidth -- it can be made small if pipe is large size.
In practice TS and T1 are most important and Td is unimportant

HTML version of Scripted Foils prepared 11 November 1996

Foil 22 Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 375.8
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

HTML version of Scripted Foils prepared 11 November 1996

Foil 23 Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).

From CPS615-Lecture on Computer Architectures and Networks Delivered Lectures of CPS615 Basic Simulation Track for Computational Science -- 12 September 96. *
Full HTML Index Secs 558.7
Square blocks indicate shared memory copy performance
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Fri Aug 15 1997