New CPS615 Foils-- B 28 August 95
	Computational Science CPS615
	Simulation Track Overview
	Foilsets B 1995
		Geoffrey Fox
		Syracuse University
		NPAC 3-131 CST, 111 College Place
		Syracuse NY 13244
	Abstract of CPS615 Foilsets B 1995
		Parallel Computer and Network Architecture
		Overview of Issues including synchronization, granularity and 3 classes of architectures
		More details on networks
		More details on system architectures
	Overview of 
	Parallel Hardware Architecture
		 
	3 Major Basic Hardware Architectures
		MIMD Distributed Memory
		MIMD with logically shared memory but perhaps physically distributed. The latter is sometimes called distributed shared memory.
		SIMD with logically distributed or shared memory
		One interesting hardware architecture .........
			Associative memory - SIMD or MIMD or content addressable memories ..... whose importance deserves further study and integration with mainstream HPCCI
		Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of the above.
			Metacomputers can vary full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D
	Examples of the Three Current Concurrent Supercomputer Architectures
		This list stems from 1990
	Parallel Computer Architecture Issues
		Two critical Issues are:
		Memory Structure
			Distributed
			Shared
			Cached
		and Heterogeneous mixtures

		Control and synchronization
		SIMD -lockstep synchronization
		MIMD -synchronization can take several forms
			Simplest: program controlled message passing
			"Flags" in memory - typical shared memory construct
			Special hardware - as in cache and its coherency (coordination between nodes)
	General Types of Synchronization
		
		An implied constraint on the ordering of operations.
			In distributed memory, messages often used to synchronize. 
			In shared memory, software semaphores, hardware locks, or other schemes needed. [Denning, P. 1989]
		Synchronous: Objects (processors, network, tasks) proceed together in lockstep fashion
		Asynchronous: Objects may proceed at somewhat different speeds, sometimes with occasional synchronization points (barriers) to allow slower objects to catch up.
	Granularity of Parallel Components
		Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ communication ratio very high -- Typical of Networked Metacomputing
		Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ration often hundreds or more. Typical of MIMD Parallel Systems such as SP2 CM5 Paragon T3D
		Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors typically have instructions broadcasted to them.Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Dally's J Machine or commercial Myrianet (Seitz)
		Note that a machine of one type can be used on algorithms of the same or finer granularity
	Types of Parallel Memory Architectures
	-- Logical Structure
		Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)
		Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures.
	Types of Parallel Memory Architectures -- Physical Characteristics
		Uniform: All processors take the same time to reach all memory locations.
		Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
			DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logicall shared but physically distributed.
			ALEWIFE from MIT is a similar project
			TERA (from Burton Smith) is Uniform memory access logically shared memory machine
		Most NUMA machines these days have two memory access times
			Local memory (divided in registers caches etc) and
			Nonlocal memory with little or no difference in access time for different nonlocal memories
	Diagrams of Shared and Distributed Memories
		 
	Classes of Communication Network include ...
		Classes of networks include:
		Bus: All processors (and memory) connected to a common bus or busses.
			Memory access fairly uniform, but not very scalable due to contention
			Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by BUS. The BUS accessed memories can be local memories on other processors
		Switching Network: Processors (and memory) connected to routing switches like in telephone system.
			Switches might have queues, "combining logic", which improve functionality but increase latency.
			Switch settings may be determined by message headers or preset by controller.
			Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
			Usually NUMA, blocking, often scalable and upgradable
	Survey of Issues in Communication Networks
		  
	Glossary of Useful Concepts in Communication Systems
		Useful terms include:
		Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)
		Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link?  How much of system is lost by fault?
		Blocking: Some communication requests may not get through, due to conflicts caused by other requests.
		Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.
		Latency (delay): Maximal time for nonblocked request to be transmitted.
		Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.
	Switch and Bus based Architectures
		Switch
		
		
		
		
		
		
		
		
		
		
		
		
		

		Bus
	Point to Point Networks (Store and Forward) -- I
		 Processors directly connected to only certain other processors, and must go multiple hops to get to additional processors. Also called store-and-forward.
		Usually distributed memory
		Examples:
			ring
			mesh, torus
			hypercube -- as in original Caltech/JPL machines
			binary tree
		Usually NUMA, nonblocking, scalable, upgradable
		Processor connectivity modeled by a graph in which nodes represent connections between processors. Much theoretical work here but now obsolete (irrelevant) as even if this hardware used, pipelining masks transmission time which ends up not showing term proportional to distance travelled
	Examples of Interconnection Topologies
		Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.
		Communication (operating system) software ensures that systems appears fully connected even if physical connections partial
	Degree and Diameter of Ring and Mesh(Torus) Architectures
		Mesh seems to have poor diameter but Dally and Seitz emphasize can make up as easy to build physically (circuits are two dimensional). Thus can have fewer links but as physically short can be "thick" (very high bandwidth)
	Degree and Diameter of Hypercube and Tree Architectures
		Hypercube compromises so both degree and diameter grow modestly. High degree implies hard to build -- low diameter improves communication
		Tree has bottleneck at root(top). In fat tree address this as in CM5 by increasing bandwidth on links as one goes up the tree
	Point to Point Networks (Store and Forward) -- II
		Hardware may handle only single hop, or multiple hops as in routing chips on Paragon.
		Software may mask hardware limitations so one sees full connectivity even if physically limited. Note that multiple hops always leads to poorer latency as this is travel time per bit. However we can keep bandwidth high even with multiple hops by increasing "size" of channels e.g. transmitting several bits simultaneously. Software can hide
			Latency by pipelining -- doing other thing while bits in transit. This is circuit switching if done at low level.
			Partial connectivity by supplying software layer that handles routing -- this is familiar on Internet
		Latency related to graph diameter, among many other factors
		Graph may also represent calculations that need to be performed, and the information exchange required.
	Latency and Bandwidth of a Network
		Transmission Time for message of n bytes:
		T0 + T1 n where
		T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize
		T0 = TS + Td . Number of hops
		T1 is the inverse bandwidth -- it can be made small if pipe is large size.
		In practice TS and T1 are most important and Td is unimportant  
		 

	Transfer Time in Microseconds for both Shared Memory Operations and Explicit Message Passing
		Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995
	Latency/Bandwidth Space for 0-byte message(Latency) and 1 MB message(bandwidth).
		Square blocks indicate shared memory copy performance
		Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995
	Switches versus Processor Networks
		Note that the processor networks such as torus or hypercube can be used to build switches when one does not put processors on internal nodes but rather "just" switch chips.
		Thus switches are discussed with same trade-offs  -- store and forward, circuit-switched as processor networks.
		Switch based machines have typically the same delay (travel time) between any two processors
		In Processor networks, some machines can be nearer to each other if fewer hops
		BUT in all modern machines, low level hardware essentially makes all these architectures the SAME. There are only two times of importance corresponding to DATA LOCALITY or not
			Time to access memory on processor
				This further divides into time to get to main DRAM and time to get to cache
			Time to access memory off processor
			Here time covers both latency and bandwidth.
	Circuit Switched Networks
		Rather than have messages travel single hop at a time, sometimes circuit is first established from sender to receiver, and then data transmitted along circuit.
		Similar to establishing phone connection
		Can result in significantly lower communication overhead as latency deterministic once circuit established
		If circuit blocked, many options possible, including
			Retry, perhaps with alternative circuit or random delay
			Wormhole routing: message travel like a wagon-train along path, stops when blocked but stays in circuit.
			Virtual cut-through: message dumps into memory where forward progress blocked, then retries. Blend of pure circuit-switching and store-and-forward.
		At high levels of message traffic, performance sometimes severely degraded.
	Let's Return to General Parallel Architectures in more detail
		 
	Overview of Computer Architecture Issues
		Many issues both between architectures and internal to each architecture
		Situation confused as need to understand
			"theoretical" arguments 	e.g. fat-tree v. torus
			technology trade-offs now 	e.g. optical v. copper links
			technology as a function of time
		is a particular implementation done right
			in basic hardware
			as a system with software	
		All that perhaps is important is what either user or high level software (compiler/message passing system) sees.
	Some Global Computer Architecture Issues
		A significant problem is that there is essentially no application software and essentially no machines sold
		Real World Architectural issue - what problems will produce substantial sales - should optimize for these?
			So maybe simulation machines will "ride on" commercial database and multimedia server (Web server) markets
			Historically this created problems as in IBM3090 which was business machine adapted for science and very poor at science
			Currently SP2 good for science and doing well in Business applications
		System integration issues
			Does software allow hardware to realize potential ?
			Hardware, software reliability
			(Parallel) I/O for graphics and file access
			I/O interfaces - HIPPI, FCS, ATM ... 	ATM most important in real world but HIPPI scientific standard? 
	Two General Real World Architectural Issues
		Why is MPP always behind the best workstations?
			It takes n extra years to build a new architecture
			Therefore MPP uses sequential technologies which are n years behind
			n=2 implies ~ factor of 3 in cost/performance
		Must reduce delay n by matching parallel design as much as possible to 
			Commercial processor chips (use workstation not custom nodes)
			Commercial communication -- (use ATM not custom networks?)
			Commercial Software -- (Use Web not custom MPP software)
		Memory per node (SIMD and MIMD)
			Some say that current machines waste transistors - activity only occurs in 1% of transistors at a time?
			Others say need large memory to run UNIX, increase grain size as problems complicated and further communication needs decrease as grain size increases
	MIMD Distributed Memory Architecture
		A special case of this is a network of workstations (NOW's) or personal computers
		Issues include:
			Node - CPU, Memory
			Network
			Bandwidth
			Latency
			Hardware
			actual (software)
	Some MIMD Architecture Issues
		Choice of Node is a dominant issue?
			RISC nodes SGI Challenge, IBM SP-2, Convex, Cray T3D with MIPS, IBM HP and Digital workstation chips respectively
			Special purpose nodes  -- CM-5, Paragon, Meiko CS-2 -- out of favor as non competitive
			"old" nodes  -- nCUBE-2
			small nodes	Caltech Mosaic which is basis of Myrianet 	J Machine (Dally at MIT)	Execube (Kogge - Loral ex IBM)
		Network Topology as described is not very important today?
			Theoretical issues obscured by technology and implementation
			Details of network can be and are hidden from user and compiler by simple elegant (message passing) software including collective communication primitives (broadcast, reduction, etc.)
		However we still have two major types of network:
			Distributed Shared Memory -- physically distributed memory but hardware support for shared memory as in SGI and Convex machines
			Pure Distributed Memory as in IBM SP-2 or network of workstations
	SIMD (Single Instruction Multiple Data) Architecture
		CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network
		Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) 	arithmetic, two-dimensional mesh and general switch
		Execube - 16 bit processors with 8  integrated into IBM 4 mbit memory chip, SIMD or MIMD or both,
		512 processors on IBM RS6000 with three dimensional  mesh
	SIMD Architecture Issues
		Node 1bit, 4 bit, 16 bit, 32 bit?
		
			
		Interconnect issues similar to MIMD
		Critical point -- is SIMD more cost effective than MIMD on a significant number of problems
			What types of problems run well on SIMD?
			Take a problem - e.g. SOR for PDE solution - that runs well on SIMD - is the MP-2 more cost effective than CM-5, Paragon, SP-1?
			Need to compare SIMD and MIMD machines at "same technology" stage
		Current Consensus is opposite of 1990 -- namely now MIMD dominates.
			SIMD AMT DAP and Maspar aimed at Market niches
			Defense signal processing
			Business index sorting
	Shared Memory Architecture
		Some machines have a "true" shared memory. i.e. every location is "equally accessible" by each node. 	Original Ultracomputer, Burton Smith's Tera, Some Bus based Systems																																
		Others, such as BBN Butterfly, Kendall Square KSR-2, have non-uniform access time but all memory locations accessible
	The General Structure of a full sized CRAY C-90
		Each C.P.U. - Two Pipes;
		Each pipe one add and one 
		multiply per clock period
		
		
		

		Cycle Time 4 nanoseconds
	The General Structure of a NEC SX-3
	Classic Vector Supercomputer
		each C.P.U. has upto 4 functional units 
		
		
		
		

		Cycle Time 2.9 nanoseconds
	Comparison of MIMD and SIMD Parallelism seen on Classic Vector Supercomputers
			DO 1 	i = 1, alotoftimes

			many thousands of lines of detailed calculations such as:

			Do 2 y = 1,4
			2	a(y) = a(y) + b(y) * c(y)
			
			
			

		1 	continue
		
		Note MIMD parallelism "larger" than SIMD as MIMD size reflects number of grid points,particles etc.
	Shared versus Distributed Memory
		Expected Architectures of Future will be:
			Physically distributed but hardware support of shared memory for tightly coupled MPP's such as future IBM SP-X, Convex Exemplar, SGI (combined with Cray)
			Physically distributed but without hardware support -- NOW's and COW's -- The World Wide Web as a Metacomputer
		Essentially all problems run efficiently on a distributed memory BUT
		Software is easier to develop on a shared memory machine
		Some Shared Memory Issues:
			Cost - Performance : additional hardware (functionality, network bandwidth) to support shared memory
			Scaling. Can you build very big shared memory machines?
				Yes for NUMA distributed shared memory
			Compiler challenges for distributed shared memory are difficult and major focus of academic and commercial work
			This is not practically important now as 32 node KSR-2 (from past) or SGI Power Challenge (cost ~< $2m) is already at high end of important commercial market
	What will happen in the year 2015 with .05 micron feature size and Petaflop Supercomputers using CMOS
		Only Superconducting Technology can possibly do better 
		Need Optical Interconnects probably
	CMOS Technology and Parallel Processor Chip Projections
		See Chapter 5 of Petaflops Report -- July 94
	Processor Chip Requirements for a Petaflop Machine Using 0.05 Micron Technology
		See Chapter 5 of Petaflops Report -- July 94
	Three Designs for a Year 2015 Petaflops machine with 0.05 micron technology
		See Chapter 6 of Petaflops Report -- July 94
	The Global Shared Memory Category I Petaflop Architecture
		See Chapter 6 of Petaflops Report -- July 94
		This shared memory design is the natural evolution of systems such as the Cray-2,3 or Cray C-90
	Category II Petaflop Architecture -- Network of microprocessors
		See Chapter 6 of Petaflops Report -- July 94
		This architecture generalizes cutrrent IBM SP-2 type system and requires unlike Category I, data locality for the upto 40,000 CPU's to be able function efficiently with minimum communication overhead
	Category III Petaflop Design -- Processor in Memory (PIM)
		See Chapter 6 of Petaflops Report -- July 94
		This design is an extrapolation of systems such as the J machine(Dally), Execube (Loral) or Mosaic(Seitz). It features CPU and memory integrated on the chip (PIM).
		Unlike such systems today, in the year 2015 such PIM designs have substantial memory per processor
	Necessary Latency to Support Three Categories
		This is formal latency to ensure that one can reach design goals of previous table. I picosecond (Category I) cannot be attained. You solve this by having N concurrent streams -- Then each of them needs latency of N picoseconds
		e.g. If output rate of an arithmetic unit is 1 per nanosecond, then each of 400 teraflop CPU's in Category I design must have 1000 arithmetic units running at full speed to each full system performance of 0.4 petaflops
		See Chapter 6 of Petaflops Report -- July 94
	Chip Density Projections to year 2013
		Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94
	DRAM Chip count for Construction of Petaflop computer in year 2013 using 64 Gbit memory parts
		Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94
	Memory Chip Bandwidth in Gigabytes/sec
		Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94
	Power and I/O Bandwidth (I/O Connections) per Chip throught the year 2013
		Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94
	Clock Speed and I/O Speed in megabytes/sec per pin  through year 2013
		Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94
	Rules for Making Hypercube Network Topologies
		A Hypercube is 2d computers arranged on the corners of a cube in d dimensions and connected to d nearest neighbors
		Label a hypercube e1 e2 e3 e4.......ed
		where each ek takes values zero or 1 depending if vertex at "top" or "bottom" of this axis. 
		Think of hypercube as unit cube in d dimensions with origin at bottom left hand corner
	Mapping of Hypercubes into Three Dimensional Meshes
		In string e1 e2 e3 e4.......ed, one can write d=d1+d2+d3
		Any set of d1 binary indices can be mapped into a one dimensional line with periodic (wrap-around) connections
		the remaining d2+d3 indices give other two dimensions
		This decomposes hypercube into a mesh with:
		
		              2d1 x 2d2 x 2d3 nodes

		So Hypercube includes lots of meshes
		e.g. d=6 hypercube has 4 by 4 by 4, 8 by 8, 4 by 64, 64 by 1 etc.
		So can study one dimension without loss of generality!
	Mapping of Hypercubes into One Dimensional Systems
		d= 1 -- 2 nodes
		

		d= 2 -- 4 nodes
		
		
		

		d= 3 -- 8 nodes
	The One dimensional Mapping can be thought of as for one dimensional problem solving or one dimensional layout of chips  forming hypercube
		So in Hadrian's wall we saw a one dimensional layout of processors (people) to solve a one dimensional problem
		However computers need to be laid out in two or three dimensional worlds and we can use same ideas.
		We consider one dimension as discussion in two or three similar as each axis(dimension) corresponds to a distinct subset of binary indices
	Hypercube Versus Mesh Topologies
		Hypercube:
		
		
		
		
		
		

		Mesh with equivalent wire length
		
		

		So can let long hypercube wires have "local" stops and hypercube becomes a "thick mesh"
		With good routing technologies, a thick mesh is better than a hypercube?
	Switches Versus Networks versus Fat Meshes
		We can in diagram below choose the staging point X to be:
			Wires as in hypercube or switch routing network elements
			The routing elements can be separate from or identical with processor 
		Conversely the "processor"    always routes messages in a hypercube. This routing element can be attached to CPU as in hypercube or an uncoupled processor -- routing chip pair as in switch
			Note modern systems often have separate CPU for routing (dedicated special purpose VLSI) and calculation -- see Intel Paragon   
	Basic Parallel Computer Architecture
		 
	Shared and Hierarchical Memory Computers
		 
	Homogeneous and Hierarchical Memory Multicomputers
		Hierarchical Hypercube -- Spatial and Temporal Decomposition
		
		
		
		
		
		
		
		
		
		
		

		Homogeneous Hypercube -- Purely Spatial Decomposition
	Cache Coherency
		Cannot be manipulated independently
	Hybrid Distributed/Shared Memory Architectures
		Consider "nodes" = printed circuit boards (Paragon) or bunch of printed circuit boards sharing a BUS (SGI Challenge) where the N nodes  are connected as a distributed memory architecture
		Inside each node, may have n=2,4,8 .... processors sharing a single memory.
		Should one use this shared memory or just consider total machine as distributed with Nn nodes?
		This is seen today (1995) as a network of SGI Challenge systems (i.e. metacomputer where conventional HIPPI network joins set of shared memory machines) or as a specialized torus network connecting Paragon nodes consisting of multiple i860's
	The INTEL Delta MIMD Distributed Memory Machine
		Now we consider an example of a real system - the Intel Delta Touchstone
		This was perhaps first real parallel supercomputer used in serious work
		It is already obsolete and replaced by a large Intel Paragon which has a similar architecture
	Delta System  Overview 
		Computational Characteristics 
			Peak Speed        30.8 GFlops (DP)
			Total Memory     8.2 GBytes
			Disk Capacity     95 GBytes
		System Characteristics
			Processors consist of 572 heterogeneous nodes including 513 i860 compute processors
			System Architecture -- 16x36 2-dimensional mesh
		Physical Characteristics
			System Size -- 16 ft. long, 5 ft. high, 3 ft. deep
			Cooling Method -- Air Cooled
		Communications Characteristics
			Interconnect Network
				2-dimensional mesh
				Bidirectional data flow
				Packet-switched network of routing chips
			Internode Communications
				60-80 msec software latency
				28 MBytes/s bandwidth
	Delta System Architecture
		513 Compute Nodes; 30.8 GFlops (Double-precision Peak Performance) configured as 16x32 Compute Mesh +1 Compute Node
		8.2 GBytes Memory
		95 GBytes Disk Space
	Delta System Hardware Configuration
		513 compute nodes using i860 processors with 16 MBytes memory each
		9 service nodes using 80386 processors with 8MBytes memory
		32 disk I/O nodes using 80386 processors with 8 MBytes memory each controlling 2- 1.5 GByte disks
		9 tape I/O nodes using 80386 processors with 8 MBytes memory
			7 with 2 Exabyte 5 GByte tape drives each
			2 with 2 Exabyte 2 GByte tape drives each
		2 ethernet nodes using 80386 processors with 8 MBytes memory
		2 HIPPI interfaces using i860 processors with 32 MBytes memory
	Characteristics of INTEL i860 Compute Node
		RISC engine + 2FPUs + graphics accelerator
		40 MHz clock
		Parallel integer and floating-point processing units
		Parallel multiplier and adder units within floating-point unit
		Pipelined floating-point processing units (3-stage pipe)
		80 MFlops (SP) Peak, 60 MFlops (DP) Peak, 33 MIP's (Integer)
		all transcendental functions done in software
		16 MBytes Main Memory (expandable to 64 MBytes)
		160 MBytes/sec peak DRAM access rate
		Very hard to build good compilers for i860!
	Delta System Communication Network
		Physically Independent bidirectional 2-Dimensional Mesh
	Delta Comunication Network (cont'd)
		System-Wide Communications Fabric
			handles all inter-node communication
			handles all I/O communication
		Interconnect supports arbitrary expansion increments
		Automatically routes messages without interrupting intermediate compute nodes
		Nearly equal performance between any 2 nodes, independent of distance between them
		Programmer can ignore details of how messages move along the interconnect network
		Mesh Routing Chips have 5 bi-directional channels each
			Mesh routing chip delivers messages from sending to receiving nodes 
			Developed at Caltech by Prof. Charles Seitz under DARPA sponsorship
		Interconnect supports 28 MBytes/sec node-node bandwidth
		Allows node fault isolation
	Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- I
		1) 	ILLIAC
		2) 	ICL DAP � AMT DAP
		3) 	Goodyear MPP �  LORAL  ASPRO used on E2C plane	                                     s        
				   IBM Federal Systems  Execube (Kogge now at Notre Dame)
		4) 	Denelcor* HEP founded by Burton Smith � Tera
		5) 	Cray XMP/22 characterized by Shared Memory and Powerful nodes
		6) 	Ultracomputer (NYU) � IBM RP3
		7) 	Cosmic Cube (Mark I) � Intel
		                   t                           FPS*               
			Mark II Hypercube           Ametek (Symult)*
		                   t                            nCUBE
			Mark III Hypercube
	Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- II
		8) 	Sequent	Encore*
		9) 	Alliant* � Cedar* (Illinois)
		10) Multiflow*  Wide word �  Super Scaler
		                                       (compilers are good)
			Cydrome*
		11) Intel iPSC1,2   -- First commercial MIMD 
		                                   distributed memory	nCUBE	FPS* T Series	Symult (Ametek)*
	Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- III
		
		12) 	Transputers (open nCUBE)    Embedded systems
			       Inmos � Meiko � Parsytec
			      Transtech etc.
		13) 	CM-2               Real SIMD systems
			     Maspar (DECmpp)
			     AMT DAP
		14) 	nCUBE-2                       Powerful Hypercubes
			     Intel Touchstone iPSC/860 (Gamma,Delta)
			( i860 chip in many add on systems like transputers)
	Road Map to Paul Messina's Chapter 2 of Parallel Computing Works -- IV
		15) 	Intel Delta (528 nodes)   MIMD but not Hypercube (mesh)	              t
			commercial epsilon (Paragon)
		16)	BBN* TC2000 (Butterfly)	 Series of elegant but slow shared memory
		17)	Cray C-90  Very successful "conventional" supercomputer
		18) 	CM-5    New MIMD Distributed Memory
		                        signaled "death" of SIMD as major player and first of set of
		                       "serious" commercial machines
			     Paragon
			     Fujitsu ( $80M system is world's most powerful, Jan. '94 )
			     Meiko CS-2
			     Cray T3D
		            IBM SP-1,2
			     Convex Exemplar  shared memory hierarchical
		19) 	DASH       Virtual Shared Memory
			    Alewife
			    KSR
			    Tera  true UMA -- Uniform Memory Access
	Performance of High End Machines Years 1940-2000
		  
	Performance of High End Machines Years 1980-2000