Parallel Machines
	
		Two theories of parallel execution: 
			³Two heads are better than one.²
			³Too many cooks spoil the soup.²
		Both theories can happen in practice.
			Keeping all the processors busy and accessing local memory means a good collaboration.
			Synchronization and accessing nonlocal memory mean the chefs are fighting over the pots.
		Generalities about modern parallel processors:
			Individual processors are reasonably fast
			Local memory access costs one to ten cycles
			Nonlocal access costs tens to hundreds of cycles
			Synchronization costs hundreds to thousands of cycles
	Parallel computers allow several CPUs to contribute to a computation simultaneously.
	For our purposes, a parallel computer has three types of parts:
		Processors n
		Memory modules n
		Communication / synchronization network n
	Key points:
		All processors must be busy for peak speed.
		Local memory is directly connected to each processor.
		Accessing local memory is much faster than other memory.
		Synchronization is expensive, but necessary for correctness.
Distributed Memory Machines
	
		Conceptually, the Paragon is very similar to the CM-5.
			Bandwidth is higher
			Latency is longer
			The network topology is a two-dimensional torus
		To program these machines:
			Divide the problem to minimize number of messages while retaining parallelism
			Convert all references to global structures into references to local pieces
			Optimization: Pack messages together to reduce fixed overhead (almost always needed)
			Optimization: Carefully schedule messages (usually done by library)
	Every processor has a memory others can¹t access.
	Advantages:
		Can be scalable
		Can hide latency of communication
	Disadvantages:
		Hard to program
		Program and O/S (and sometimes data) must be replicated
Shared-Memory Machines
	
		Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit.
		These machines share data by direct access.
			Potentially conflicting accesses must be protected by synchronization.
			Simultaneous access to the same memory bank will cause contention, degrading performance.
			Some access patterns will collide in the network (or bus), causing contention.
			Many machines have caches at the processors.
			All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently.
	All processors access the same memory.
	Advantages:
		Easy to program (correctly)
		Can share code and data among processors
	Disadvantages:
		Hard to program (optimally)
		Not scalable due to bandwidth limitations
Distributed Shared Memory Machines
	
	Combining the advantages of shared and disttributed memory
	Lots of hierarchical designs are appearing.
		Typically, ³nodes² with 4 to 32 processors (see below)
		Each processor has a local cache
		Processors within a node access shared memory
		Nodes can get data from or put data to other nodes¹ memories
Parallel Algorithms
	
		This is just one view of parallel programs (and algorithms).
		Note: The tasks mentioned here may not be the tasks you're used to.
			Here, ³task² means ³a sequence of operations that is scheduled atomically and can run to completion once started²
			That is, a task is the computation between synchronization points
			For example, an MPI process is a collection of tasks
	A parallel algorithm is a collection of tasks and a partial ordering between them.
	Design goals:
		Match tasks to the available processors (exploit parallelism).
		Minimize ordering (avoid unnecessary synchronizations).
		Utilize remaining ordering (exploit locality).
	Sources of parallelism:
		Data parallelism: updating array elements simultaneously.
		Functional parallelism: conceptually different tasks which combine to solve the problem.
		Speculative parallelism: temporarily ignore partial ordering, repair later if this causes problems.
		This list is not exhaustive
Data Parallelism in Algorithms
	Data-parallel algorithms exploit the parallelism inherent in many large data structures.
		Many elements in structure ¤ many update operations
		Example: Red-Black Relaxation
			All red points can be updated in parallel
	Analysis:
		Scalable parallelism
		Synchronization and locality may be suspect
Functional Parallelism in Algorithms
	Functional parallelism exploits the parallelism between the parts of many systems.
		Many pieces to work on ¤ many independent operations
		Example: Aeroelasticity (wing design)
			CFD and CSM (and others, if need be) can be evaluated in parallel
	Analysis:
		Parallelism limited
		Synchronization probably good
Parallel Languages
	A parallel language provides an executable notation for implementing a parallel algorithm.
	Design criteria:
		How are parallel operations defined?
			static tasks vs. dynamic tasks vs. implicit operations
		How is data shared between tasks?
			explicit communication/synchronization vs. shared memory
		How is the language implemented?
			low-overhead runtime systems vs. optimizing compilers
	Usually a language reflects a particular type of parallelism.
Data-Parallel Languages
	
		Data parallelism originated on SIMD machines
			But has since been generalized in concept
			And has been implemented on many architectures
			The definition is still somewhat fuzzy
		HPF is largely based on data parallelism
		Other data-parallel languages of note:
			C*
			DataParallel C (taken from C*)
			NESL
			Fortran D, Fortran 90D, Vienna Fortran
	Data-parallel languages provide an abstract, machine-independent model of parallelism.
		Fine-grain parallel operations, such as element-wise operations on arrays
		Shared data in large, global arrays with mapping ³hints²
		Implicit synchronization between operations
		Partially explicit communication from operation definitions
	Advantages:
		Global operations conceptually simple
		Easy to program (particularly for certain scientific applications)
	Disadvantages:
		Unproven compilers
Data-Parallel Languages, cont.
	
		Data parallelism is (an attempt at) a high-level language.
		High-level languages have inherent advantages and disadvantages:
			It's easier to concentrate on the high-level ideas.
				This means you can write the program faster 
				This means you're more likely to get it right
			The underlying system takes over the low-level details.
				This means you can't micro-manage the execution
				This means you can't always tell what will happen
	Abstractions like data parallelism split the work between the programmer and the compiler.
	Programmer¹s task: Solve the problem in this model.
		Concentrate on high-level structure and concepts
		Aggregate operations on large data structures
		Data in global arrays with mapping information
	Compiler¹s task: Map conceptual (massive) parallelism to physical (finite) machine.
		Fill in the grungy details
		Guided by user ³hints² like mapping information
		Optimizing computation and communication
Message-Passing Systems
	
		For comparison to data-parallel languages
	Program is based on relatively coarse-grain tasks
	Separate address space and a processor number for each task
	Data shared by explicit messages
		Point-to-point and collective communications patterns
	Examples: MPI, PVM, Occam
	Advantages:
		Close to hardware.
	Disadvantages:
		Many low-level details.