New CPS615 Foils 25 March95
	CPS615 -- Base Course for the Simulation Track of Computational Science
	Fall Semester 1995
	Foilsets A
		Geoffrey Fox
		NPAC
		Room 3-131 CST
		111 College Place
		Syracuse NY 13244-4100
	Contents of Foilsets A of CPS615 Computational Science
		Technology Driving Forces for HPCC
		Overview of What and Why is Computational Science
			Needs to be expanded with further remarks on Information track and degree/certificate requirements
		Elementary Discussion of Parallel Computing in the "real-world"
			Hadrian Wall example
		Sequential Computer Architecture 
	The Technology 
	Driving Forces for HPCC
		 
	Effect of Feature Size on Performance
		  
	Growing Logic Chip Density
		  
	Trends in Feature and Die Size as a Function of Time
		  
	Supercomputer Memory Sizes and trends in RAM Density
		RAM density increases by about a factor of 50 in 8 years 
		Supercomputers in 1992 have memory sizes around 32 gigabytes (giga = 109)
		Supercomputers in year 2000 should have memory sizes around 1.5 terabytes (tera = 1012)  
	Comparison of Trends in RAM Density and CPU Performance Increases
		Computer Performance is increasing faster than RAM density
	National Roadmap for Semiconductor Technology --1992
		See Chapter 5 of Petaflops Report -- July 95
	CMOS Technology and Parallel Processor Chip Projections
		See Chapter 5 of Petaflops Report -- July 95
	What and Why is Computational Science ?
		 
	Parallelism Implies Major Changes which have significant educational Implications
		Different machines
		New types of computers
		New libraries
		Rewritten Applications
		Totally new fields able to use computers ....                              	                                                                                                                   	                                                                                       	 ==>	Need new educational initiatives                      			                                                                                        		                                                          	Computational Science
		Will be a nucleus for the phase transition
		and accelerate use of parallel computers in the real world
	Program in Computational Science
	Implemented within current academic framework
		  
	Program in Information Age Computational Science Implemented Within Current Academic Program
		 
	Elementary Discussion of
	 Parallel Computing
		 
	Single nCUBE2 CPU Chip
		 
	64 Node nCUBE Board
		Each node is CPU and 6 memory chips -- CPU Chip integrates communication channels with floating, integer and logical CPU functions 
	CM-5 in NPAC Machine Room
		32 node CM-5 and in foreground old CM-2 diskvault
	Basic METHODOLOGY of Parallel Computing
		Simple, but general and extensible to many more nodes is domain decomposition
		All successful concurrent machines with
			Many nodes
			High performance (this excludes Dataflow)
		Have obtained parallelism from "Data Parallelism" or 	"Domain Decomposition"
		Problem is an algorithm applied to data set
			and obtains concurrency by acting on data concurrently.
		The three architectures considered here differ as follows:
		MIMD Distributed Memory
			Processing and Data Distributed
		MIMD Shared Memory
			Processing Distributed but data shared
		SIMD Distributed Memory
			Synchronous Processing on Distributed Data	
	Concurrent Computation as a Mapping Problem -I
		2 Different types of Mappings in Physical Spaces
		Both are static
			a) Seismic Migration with domain decomposition on 4 nodes
			b)Universe simulation with irregular data but static 16 node decomposition
			but this problem would be best with dynamic irregular decomposition
			
	Concurrent Computation as a Mapping Problem - II
		 Different types of Mappings -- A very dynamic case without any underlying Physical Space
		c)Computer Chess with dynamic game tree decomposed onto 4 nodes
	Concurrent Computation as a Mapping Problem - III
		 
	Finite Element Mesh From Nastran
	(mesh only shown in upper half)
		 
	A Simple Equal Area Decomposition
		And the corresponding poor workload balance
	Decomposition After Annealing
	(one particularly good but nonoptimal decomposition)
		And excellent workload balance
	Parallel Processing and Society
		The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists.

		If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people...

		cf. Construction of Hadrians Wall 
	Concurrent Construction of a Wall
	Using N = 8 Bricklayers
	Decomposition by Vertical Sections
		Domain Decomposition is Key to Parallelism
		Need "Large" Subdomains l >> l overlap
	Quantitative Speed-Up Analysis for Construction of Hadrian's Wall
		  
	Amdahl's law for Real World Parallel Processing
		AMDAHL"s LAW or
		Too many cooks spoil the broth

		Says that 
		Speedup S is small if efficiency e small

		or for Hadrian's wall
		 equivalently S is small if length l small

		But this is irrelevant as we do not need parallel processing unless problem big!
	Pipelining --Another Parallel Processing Strategy for Hadrian's Wall
		"Pipelining" or decomposition by horizontal section is:
			In general less effective
			and leads to less parallelism
			(N = Number of bricklayers must be < number of layers of bricks)
	Hadrian's Wall Illustrates that the Topology of Processor Must Include Topology of Problem
		Hadrian's Wall is one dimensional
		Humans represent a flexible processor node that can be arranged in different ways for different problems
		The lesson for computing is:
		Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability.
	General Speed Up Analysis
		Comparing Computer and Hadrian's Wall Cases  
	Comparison of The Complete Problem to the subproblems formed in domain decomposition
		The case of Programming a Hypercube
		Each node runs software that is similar to sequential code 
		e.g., FORTRAN with geometry  and boundary value sections changed
	Hadrian's Wall Illustrating an
	Irregular but Homogeneous Problem
		Geometry irregular but each brick takes about the same amount of time to lay.
		Decomposition of wall for an irregular geometry involves equalizing number of bricks per mason, not length of wall per mason.
	Some Problems are Inhomogeneous Illustrated by:
	An Inhomogeneous Hadrian Wall with Decoration
		Fundamental entities (bricks, gargoyles) are of different complexity
		Best decomposition dynamic
		
		
		
		
		
		
		

		Inhomogeneous problems run on concurrent computers but require dynamic assignment of work to nodes and strategies to optimize this 
		(we use neural networks, simulated annealing, spectral bisection etc.)
	Global and Local Parallelism Illustrated by Hadrian's Wall
		Global Parallelism
			Break up domain
			Amount of Parallelism proportional to size of problem (and is usually large)
			Unit is Bricklayer or Computer node
		Local Parallelism
			Do in parallel local operations in the processing of basic entities
				e.g. for Hadrian's problem, use two hands, one for brick and one for mortar while ...
				for computer case, do addition at same time as multiplication
			Local Parallelism is limited but useful
		Local and Global Parallelism 
		Should both be Exploited
	Parallel I/O Illustrated by
	Concurrent Brick Delivery for Hadrian's Wall
	Bandwidth of Trucks and Roads
	Matches that of Masons
		Disk (input/output) Technology is better matched to several modest power processors than to a single sequential supercomputer

		Concurrent Computers natural in databases, transaction analysis
	Nature's Concurrent Computers
		At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites
		At a coarser resolution 
		Society is a collection of brains sending and receiving messages by sight and sound

		Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals
		Lesson: All Nature's Computers Use Message Passing
		With several different Architectures
	Comparison of Concurrent Processing in Society and Computing
		Problems are large - use domain decomposition                              Overheads are edge effects
		Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains
		Regular homogeneous problems easiest but 
		irregular or  
		Inhomogeneous    			                                                                                                  
		Can use local and global parallelism
		Can handle concurrent calculation and I/O
		Nature always uses message passing as in parallel computers (at lowest level)
	Sequential Computer Architecture 
		 
	Sequential Computer Architecture
		  
	Instruction Flow in A Simple Machine  Pipeline
		Three Instructions are shown overlapped -- each starting one clock cycle after last 
	Examples of Superpipelined (a) and superscaler (b) machine pipelines