GCF Talk U - Lewis/NASA 3/9/92
	Overview of High Performance Computing for Physical Sciences
		Geoffrey C. Fox
		gcf@nova.npac.syr.edu
		315.443.2163
		Mardi Gras Conference
		Concurrent Computing in the Physical Sciences
		Louisiana State University
		Baton Rouge, LA
		February 18, 1993
	Performance of Computers as a Function of Time
		 
	Performance of QCD Machines
		 
	When will Parallel Computing Take Over  ?
		This is a phase transition																												
		Needs headroom (Carver Mead) which is large (factor of 10 ?) due to large new software investment
		nCUBE-1 and CM-2 were comparable in cost performance to conventional supercomputers
			Enough to show that "Parallel Computing Works"
			Not enough to take over!
		Intel Paragon, CM-5, DECmpp (MP-2), IBM SP-1 have enough headroom?
	Parallelism Implies
		Different machines
		New types of computers
		New libraries
		Rewritten Applications
		Totally new fields able to use computers ....                              	                                                                                                                   	                                                                                        ==>	Need new educational initiatives                      			                                                                                        		                                                          Computational Science
		Will be a nucleus for the phase transition
		and accelerate use of parallel computers in the real world
	Program in Computational Science  		Implemented within current academic framework
		  
	We have learnt that Parallel Computing Works !
		Data Parallelism - universal form of scaling parallelism	
		Functional Parallelism - Important but typically modest speedup. - Critical in multidisciplinary applications.	
		On any machine architecture
			Distributed Memory MIMD
			Distributed Memory SIMD	
			Shared memory - this affects programming model		                                                                     This affects generality                                                   SIMD 	~ 50% academic problems                                but 		< 50% commercial
	METHODOLOGY
		Simple, but general and extensible to many more nodes is  	domain decomposition
		All successful concurrent machines with
			Many nodes
			High performance (this excludes Dataflow)			                                   Have obtained parallelism from "Data Parallelism" or 	"Domain Decomposition"		
		Problem is an algorithm applied to data set					Obtain concurrency by acting on data concurrently.				                                      The three architectures considered here differ as follows:
			MIMD Distributed Memory	Processing and 			Data Distributed
			MIMD Shared Memory	Processing Distributed
			SIMD Distributed Memory	Synchronous 			Processing on 			Distributed Data	
	Sideways View of Concurrent Computing Mappings
		 
	Sample uses of Concurrent Computers 
	- 4 Nodes
		 
	Parallel Computing Trends to Year 2000
		Hardware trends imply that all computers (PC's ---> Supercomputers) will be parallel by the year 2000
			Upto now parallel computers are from small start-up companies  (except Intel Supercomputer Division) 
			Recently Cray, Convex (HP), Digital, IBM have announced massively parallel computing initiatives 
			Several architectures but only one : Distributed memory MIMD multicomputer is known to scale from one to very many processors		
		Software is challenge and could prevent/delay hardware trend that suggests parallelism will be a mainline computer architecture 
			first we must get systems software correct 
			then applications software
	39 National Grand Challenges
		 
	39 National Grand Challenges (II)
		 
	39 National Grand Challenges (III)
		  
	39 National Grand Challenges (IV)
		 
	Most Grand Challenges are
		Partial Differential Equations
		Particle Dynamics and Multidisciplinary Integration
		Image Processing 	  		                                                                  Some:
		Visualization
		Artificial Intelligence                                                               			                                                                      Not Much:
		Network Simulation
		Economic (and other complex system) modeling
		Scheduling
		Manufacturing
		Education
		Entertainment
		Information Processing
			BMC3IS (Command & Control in military war)
			Decision Support in global economic war
	ACTION Motivation
		Industrial and Government Problems
			Are a factor of 10 to 100 larger (up to 106 lines)
				Have a longer lifetime
				Less accessible expertise	
			The parallel hardware must be much better (factor of >ten "headroom" in Carver Mead's model)  than sequential competition. This will happen in the next two years.	
			Must have better software to port and maintain large parallel codes -- can only convert once!
				Are software issues the same for academic and industrial production codes?
				Probably not? So we need to integrate computer science (software) research and development with experience from industrial codes.
	Applications of Interest to Industry IA
		 
	Applications of Interest to Industry IB
		 
	Applications of Interest to Industry IC
		 
	Pluses and Minuses of Parallel Computing
		We know parallel versions for basic algorithms currently being used in most kernels
			There are some counter examples: 	1) Parallelize incomplete LU factorization for irregular 	     sparse matrices 	2) Develop new neural network to predict and hence 	      avoid hurricanes or avoid computer failures.... 	3) Optimal preconditioning for 3D partial differential                	     equation is not known in many applications --- to be      	     expected as little experience.									
			Can't use a computer of performance 10P without substantial experience on a computer with performance P
	P C E T E C H Ñ A Result of
		The study of implementation of ACTION
			What can a small enterprise like ACTION offer much bigger industries?	
		Study of role of applications in CRPC
			What can a relatively small group of computer scientists (CRPC) offer the larger national HPCCI ?	
		Study of relation of CRPC and NSF Supercomputer Centers which are again much larger
			Ñ> Need to concentrate in areas where CRPC (ACTION)'s unique capabilities have high value	
		P C E T E C H   includes and expands technology development components of CRPC and ACTION
	The Life History of an HPC Idea
		 
	Core Enabling Software Technologies
		PVM, Express, Linda
		Einstein (CSW)
		ISIS (Cornell)			
		High Performance Fortran Compiler
		High Performance C, C++ Compiler	
		HPF Extensions - PARTI
		Parallel / Distributed Computing Runtime Tools
		ADIFOR		
		AVS and Extensions
		MOVIE (Syracuse)
		High Performance Fortran Interpreter	
		Televirtuality / Virtual Reality  OS	
		Parallel Database - Oracle 7.0
		OO Database
			Integration Methodology for above software
			Heterogeneous Network Target
	Core Enabling Algorithms and Components
		Mesh Generation	
		SCALAPACK	
		Sparse Matrix Solvers - Templates and libraries (Direct and Iterative)	
		Particle Dynamics Kernels - Templates and Libraries           ( O(N2) to fast multipole)	
		Optimization Methodology and Templates
			Linear programming
			Non-linear programming	
		Scheduling (neural-net, parallel) Templates
	Portable Scalable Languages
		We may have to rewrite our code, but we want the resulting program to run on					 "All" current and future (anticipated) machines		
		No compelling new languages e.g. OCCAM does not solve enough problems to warrant adoption					Adapt existing languages 					Fortran		C++			ADA		C			LISP		PROLOG		BASIC ..........COBOL ?		
		Can map onto  individual sequential or parallel computers or networks of these (distributed computing)
	Problem Architectures
		 
	Software Problem Architecture Interactions
		 
	Why build on existing languages - especially Fortran / C
		Experience of users	
		Good sequential and node compilers exist	
		Migration of existing codes	
		Hard to build a new language as need all the features/libraries of Fortran / C implemented well.  This is why one uses Fortran 90 and not APL 		
		Fortran and C are good (OK) for a class of applications
	Fortran( C ) Plus Message Passing
		Natural MIMD model of programming	
		Used in vast majority of current successful MIMD programming	
		Each node has program which controls and calculates  its data (owner-computes "rule" )				Do i run over local data	CALL READ (for non-local data)	CALCULATE ALL i's DATA	
		In simplest form one program per node of computer
	Fortran( C ) Plus Message Passing
		Advantages
			Portable to MIMD distributed and shared memory machines - should scale to future machines with some difficulties (overlapping communication with itself and calculation)
			Available now
			Will be industry standards	Express, PICL, Linda now
			"All Problems" can be written in it	
		Disadvantages
			Can be "hard-work" for user inserting communication calls
			Optimizations not portable
			Only MIMD
	Data Parallel Fortran
		e.g. Program represented as sequence of functions operating on arrays		
		Dimension:	A(100,100), B(100,100), C(100,100)         A = B + C
			A = B + Shift in x, y (C)
			forall i, j		A(i,j) = B (i,j) + C (i-1, j) + ......	In APL, Fortran90  - like syntax or with explicit loops	
		All data owned by a particular processor	
		Owner computes rule: for an expression A(i,j) = ..........,             Processor holding A(i,j) calculates expression after communicating any needed data if r.h.s. contains data  not owned by processor.	
		Similar concepts in C, C++, LISP, ADA .......
	Data Parallel Fortran ( C, C++, LISP, ADA )
		Advantages
			Very portable and scalable to SIMD and MIMD
			Should be able to eventually handle ~all synchronous and loosely synchronous problems including ones that only run well on MIMD
			Industry Standard will be adopted
				"High Performance Fortran"
				hope for consensus ~December 1992
				include CMFortran, FortranD .....	
		Disadvantages
			Need to wait for good compiler
			Not all problems can be expressed in it
	The Origins of High Performance Fortran
		 
	Strategy for HPF
		Program in an SPMD data parallel style
			User writes code with (conceptually) a single thread of control and globally accessible data
		Program is annotated with assertions giving information about desired data locality and/or distribution
		Compiler generates code implementing data and work distribution from the user-written SPMD application
			For MIMD - implementation is multi-threaded message passing with local data; generated synchronization and send/receive code
			for SIMD - implementation is parallel code with communication optimized by complier placement of data
			for vectors - implementation is vectorized code optimized for the vector units 
			for RISC - implementation is pipelined superscalar code with compiler generated cache management
	Model for Scalable Languages
		 
	Dynamically Triangulated Random Surfaces
		String theories are candidates for a TOE (Theory of Everything). They provide a plausible quantum theory of gravity.
		Basic objects are tiny one dimensional "strings" (rather than points). Dynamics described by a two dimensional space-time world-sheet (rather than world line).
		Calculations involve integrating over all possible world-sheets, i.e. all 2-d surfaces embedded in some higher dimensional space (3-d, 4-d, 26-d, ....)
		Done numerically by a Monte Carlo integration over different triangulations of the surface, produced dynamically using the Metropolis algorithm.
		Current research centers on finding good discretizations of the continuum theories, and trying to find critical points with divergent correlation lengths, so the details of the discrete lattice structure become irrelevant and we can extract continuum results (c.f. lattice QCD)
		Interesting critical points are phase transitions between a smooth phase and a crumpled phase of the surface. The order and critical exponents of this crumpling transition are still uncertain.
		We have done the largest simulations so far on 2-d systems, but these are only 576 node meshes. Critical slowing down is a major problem - need better algorithms.
		The same techniques are used to model real fluctuating surfaces, e.g. cell membranes, lipid bilayers, phase boundaries.
	Update Strategies for DTRS
		 
	N=36 Wireframe DTRS
		 
	N=144  Lambda =1.5 Colored DTRS
		 
	N=144 Lambda=1  Colored DTRS
		 
	Computational Issues and Code Performance Optimization
		Program for random surfaces is very irregular, with many linked list and pointer trees to traversed. Original code performed very poorly. Completely rewrote and simplified code and gained a substantial speed-up. Still performed poorly on some RISC processors such as the Intel i860.
		Optimal use of cache memory is vital for processors such as the i860, which have a small cache memory, and the time to access external memory is much greater than the time to perform a floating point operation
		Due to the dynamic nature of the code, the neighbors of a site which are used for the update of that site are constantly changing. This means we cannot make good use of the cache unless we keep a new array which constantly readjusts pointers to neighbors, so that data for neighboring sites is kept in neighboring regions of memory.
		Other optimizations on i860 and use of non-IEEE arithmetic gave us a substantial improvement, but still nowhere near 80 Mflops peak performance!
		Program performs much better on more recent processors (IBM RS/6000, Hewlett-Packard).
	Sequential Computer Architecture
		 
	Performance of Random Surface Code
		 
	Parallel Algorithms
		Independent Parallelism
			Have so far used independent parallelism on MIMD machines and workstation networks
			Due to strong critical slowing down can only be used for small meshes, since producing an independent mesh for each processor takes so long. Need to parallelize over a single simulation.
		Parallel Random Surfaces
			Load Balancing - Poor load balancing on SIMD machines, since the number of links per node can range from 3 to 20.
			Dynamic Domain Decomposition - Dynamic mesh means that neighboring points will change during the simulation and eventually be on distant processors, requiring substantial non-local communication. Need dynamic load balancing to preserve local communications (c.f. dynamic adjustment of arrays to keep neighboring points in fast cache memory on sequential machines.
			Detailed Balance - The most difficult problem is to preserve detailed balance. A site cannot be updated at the same time as the other sites on which its value depends. For a regular 2-d grid, use black/red checkerboard update, For a dynamic, irregular grid, identifying independent sites is a time-consuming problem which must be redone before every iteration.
			Irregular Adaptive Grids for PDEs - This problem has much in common with parallel algorithms for solving PDEs on irregular, adaptive grids (e.g. CFD)
			Implementation - Currently developing parallel code for CM-5 and Intel Delta
	Parallel Grid Generation
		 
	What is Fortran 90?
		FORTRAN 77 entirely contained within Fortran 90	
		Language goal is to "modernize Fortran, so that it may continue its long history as a scientific and engineering programming language"	
		Major language additions include:
			array operations
			numeric computation facilities
			parameterized intrinsic data types
			modularzation of data and procedures
			pointer facilities
			many more intrinsic functions	
		A significantly larger language than FORTRAN 77
	HPF Language Features - 1
		Data Distribution Directives
			locality of reference
			TEMPLATE directive
			ALIGN and REALIGN directives
			DISTRIBUTE and REDISTRIBUTE directives
			DYNAMIC directive
			PROCESSORS and processor VIEW directives	
		Parallel Statements
			FORALL statement and construct, PURE procedures
			INDEPENDENT directive	
		Extended Intrinsic Functions and Standard Library
			system inquiry functions
			computational functions
			distribution inquiry functions
	HPF Language Features - 2
		EXTRINSIC Procedures
			escape from global name space model
			EXTRINSIC directive
			Fortran 90 binding for EXTRINSIC interface	
		Parallel I/O Statements
			none		
		Changes in Sequence and Storage Association 
			restrictions
			SEQUENCE directive		
		In addition, a subset of HPF suitable for earlier implementation is defined
	FORTRAN-90D 
	The First Implementation of HPF 
	(NPAC, Syracuse University) 
	Current Status
		Alpha version: demonstrated at SUPERCOMPUTING '92 in Minneapolis, MN
			one- and two-dimensional BLOCK, CYCLIC and BLOCK-CYCLIC distributions
			FORALL statement
			full run-time support for regular communications
			PARTI (ICASE) primitives for irregular communications
			intrinsic library
		Now running on nCUBE2, iPSC/860 and network of SUN workstations
		beta release: June 1993
		alliance with industry to develop commercial compiler for HPF, optimized for specific architectures
		next target: heterogeneous networks
	Fortran90D Performance on Gaussian Elimination
		 
	Gaussian Elimination 
	Example of Fortran90D Source Code
		 
	Fortran90D Interface
		  	
	HPF Directives
		Syntactic form of a structured Fortran environment	
		A program that contains directives is expected to generate the same results whether directives are processed or not, assuming that the directives are used correctly	
		HPF directives are consistent with Fortran 90
			if HPF were to be adopted as part of a future Fortran standard, only the comment character and the directive prefix would have to be removed from HPF directives
			directives follow either free source form or fixed source form
	Data Alignment
		Specifies lowest-level abstract alignment of array elements	
		TEMPLATE is an abstract, very fine-grain virtual processor space 		                                                                                             			                                                                             Template labels physically independent objects e.g. gluons, fluid particles. Align degrees of freedom e.g. gluon components to template members		
		Use for machine-independent alignment between structures
	Data Distribution
		Specifies coarse-grain partioning of data	
		PROCESSORS is a coarse-grain processor grid, presumably in 1-1 correspondence with physical processors	
		Use for machine-dependent partitioning of data structures	
		Compiler will normally use "owner-computes rule"
	FORALL
		Elemental array assignment statement (Similar to array expressions)	
		FORALL is not a general parallel construct!	
		Functions without side effects may be called from FORALL 	
		Other functions may be allowed as well
	DO INDEPENDENT
		An assertion that the iterations of the DO may be executed in any order or concurrently	
		If the assertion is false, the compiler is free to take any action it deems necessary	
		No restrictions on the loop body	
		No way to perform explicit synchronization
	HPF Library
		Much of HPF's Parallel "Power" is buried in its library	
		Personal note: Fortran 90D motivated by similarities between Caltech message passing "collective" communication and F90 Intrinsics		
		18 Non-elemental (true array) Fortran 90 Intrinsics e.g. CSHIFT, SUM			
		7 new HPF Intrinsics e.g. Alignment, Template, Distribution inquiries		
		4 new HPF Reduction (combine) Functions		
		11 new HPF Combine-Scatter Functions	
		22 new HPF Parallel Prefix Functions	
		2 new HPF Parallel Sorting Functions	
		3 other new HPF Parallel Functions	
		7 HPF Local Intrinsics	
		74 library routines (==> ~1000 routines when written in clumsy F77)
	Intrinsic Library
		 
	HPF Library
		 
	Fortran 90 Local Routine Intrinsics
		  	
	Does HPF need a different type of compiler ?
		HPF makes much greater use of runtime optimizations than the traditional sequential or parallel compiler					
		Fortran 77: Do 1  I=1,N 					 1  A(I) = FUNC(B(I), B(I +/- 1), C(I), C(I +/- 1), .....) 			                                                                        			                                                                      Traditionally arrange access to these elements in COMPILER		
		HPF: A=FUNC(B,C) 			                                                                                      			                                                                           			                                                                            Invoke ALREADY OPTIMIZED parallel implementation of FUNC In many cases, HPF compiler need not know anything about parallelization except interfaces, data layout etc.
	HPF Interpreter
		Runtime optimizations acceptable (and perhaps best) as "grain size" (i.e. total number elements in array divided by Nproc) is large                                                     			                                                                 	==> "Startup" of Runtime Procedure calls is small 	==> Can use an interpreter rather efficiently cf. APL 		and *LISP on CM-2	
		There is only ONE data parallelism 				==> Same Runtime library can support 					C, C++, ADA 		APL, HPF Interpreter
	Benchmarking  Set  ( Fortran90D and Fortran 77 )    
		 
	HPF/FORTRAN-90D Benchmark Suite
		 
	The final draft of the HPF Language Specification is version 1.0 - DRAFT, dated January 25, 1993	                                                     
		Anonymous FTP 	minerva.npac.syr.edu (directory HPFF) 	titan.cs.rice.edu (directory public/HPFF/draft	think.com (directory public/HPFF) 	ftp.gmd.de (directory hpf-europe) 	theory.tc.cornell.edu (directory pub)	
		Electronic Mail                                                                                                       from the Softlib server at Rice University. This can be accessed in two ways: 				A. Send electronic mail to softlib@cs.rice.edu with "sendhpf-v10.ps" 	in the message body. This report is sent as a Postscript file. 				B. Send electronic mail to softlib@cs.rice.edu with "send 	hpf-v10.tarZ" in the message body. The report is sent as a 	uuencoded compressed tar file containing LaTeX source	
		Hardcopy 		Send requests to Theresa Chatman CITI/CRPC, Box 1892 Rice University Houston, TX 77251. There is a charge of $50.00 for this report to cover copying and mailing costs. 
	Request for Public Comment on High Performance Fortran
		HPFF encourages reviewers to submit comments as soon as possible, with a deadline of March 1, 1993 for consideration.	
		Send comments to hpff-comments@cs.rice.edu.	
		To facilitate the processing of comments we request that separate comment messages be submitted for each chapter of the document and that the chapter be clearly identified in the "Subject:" line of the message. Comments about general overall impressions of the HPF document should be labeled as Chapter 1. All comments on the language specification become the property of Rice University.	
		If email access is impossible for comment responses, hardcopy may be sent to: 	HPF Comments c/o Theresa Chatman CITI/CRPC, Box 1892 Rice University Houston, TX 77251 				HPFF plans to process the feedback received at a meeting in March. Best efforts will be made to reply to comments submitted.
	Application Structure Versus Fortran90D Features
		 
	What applications does HPF support? 
		Somewhat related issue ? 	What problems are suitable for HPF and make good use of MIMD architectures ?	 i.e. run well on MIMD but poorly on SIMD	
		Classification of Problems (Fox, 1988)	
	Current HPF can do I .....
		 
	Current HPF can do II.....
		Embarrassingly Parallel Problems	Basic entities can be very complicated but they can be  calculated independently. Use INDEPENDENT command in HPF	
		Search databases for given text ( ELIXIR for DECmpp ) 	(can be SIMD)	
		Analyse separately 107 events recorded from Synchrotron (SSC) collisions. ( definitely MIMD "farm" )	
		Calculate matrix elements for (chemistry) Hamiltonian          	(may need MIMD)	
			MOPAC, GAUSSIAN, Reaction scattering rate work within C3P at  Caltech is
				"calculate matrix elements"
				"Multiply/Solve/Find eigenvalues of matrix	
		Quantum Monte Carlo
			Independent "walkers"
			Parallel Histogram (database) storage
	HPF can express using FORALL
		Particle Dynamics with cutoff forces 
		(Vij = 0 | ri - rj | > d )  and irregular spatial structure
		CHARMM, AMBER ..............	
		Unstructured finite element meshes
			Where data structure involves array of pointers. However need new distribution primitives to produce good performance. It is known how to optimize such unstructured problems for both static and adaptive cases	
		Performance of SIMD or Autonomous SIMD (Maspar) versus MIMD unclear for these applications.
	Classic Irregular Mesh Application
		 
	HPF Can Express
		 
	Simulation of Spin Models of Magnets
		Simple models reproduce quantitative behavior of real magnetic materials with complex interactions
		Fundamental to the development of the theory of phase transitions and critical phenomena, which has very wide applicability.
		Computational algorithms and techniques can be applied to other fields, e.g.
			Monte Carlo simulation of lattice models of quantum field theory and quantum gravity
			simulation of polymers, proteins, cell membranes
			simulated annealing and other "physical" methods for global optimization
			neural networks
			condensed matter problems, e.g. superconducters
		Since the models are simple and exact results are known in some cases (e.g. 2-d Ising Model), these are used as a testing ground for new computational techniques, e.g.
			Monte Carlo Algorithms
			measurements and data analysis
			random numbers generators
	New Monte Carlo Algorithms
		The Metropolis Algorithm
			Major problem in Monte Carlo simulation is critical slowing down. Standard algorithms (e.g. Metropolis) do single site, local updates. Number of iterations needed to change the large-scale structure and produce a new uncorrelated configuration diverges as a power of lattice size at a critical point.
		Cluster Algorithm
			New multi-spin, non-local algorithms (Swendsen-Wang, Wolff) rapidly change large-scale structure by identifying clusters of sites to be updated at once, greatly reducing critical slowing down.
			Currently only applicable to a limited class of models
			Ongoing research includes
				Extensions to frustrated spin models (e.g. spin glasses) where critical slowing down is extreme
				Precise measurements of autocorrelations and dynamic critical exponents to help understand dynamics of new algorithms
				Application of new algorithms to simulation of spin models, e.g. O(3) model, fully frustrated Ising model
				Parallel cluster algorithms
			Simulated Tempering
				New Method of making small changes in temperature while keeping system in equilibrium. Applications include:
					Allowing tunneling between states at first order phase transitions (e.g. random field Ising model)
					Global optimization (a la simulated annealing)
	Magnetization for Random Field Ising Model -- Metropolis
		 
	Magnetization for Random Field Ising Model --  Simulated Annealing
		 
	Wolff cluster (bands shown in yellow) for 3 state Potts model at Tc
		 
	Swendsen-Wang clusters (boundaries shown in black) for 3 state Potts model at Tc
		 
	Cluster Algorithm versus Metropolis
		 
	Parallel Algorithms
		Metropolis Algorithm
			Regular, local algorithm
			Local communication, perfect load balance using standard domain decomposition
			Efficiently parallelizable on SIMD and MIMD machines
		Swendsen-Wang Algorithm
			Irregular, non-local, multiple cluster algorithm
			Main computational task is to identify clusters - connected component labeling, also used in image processing
			Clusters are highly irregular (fractal) in shape, and of all sizes (usually including one large cluster covering much of the lattice)
			Efficient parallel component labeling is a very difficult problem - poor load balancing, lots of non-local communication
		Wolff Algorithm
			Irregular, non-local, single cluster algorithm
			Standard domain decomposition gives very poor load balance and very limited parallelism 								                                                                    
			Traditional simple, regular, local algorithms parallelize efficiently
			Newer irregular, adaptive, non-local algorithms are better sequentially but often more difficult to parallelize
	Parallel Cluster Algorithms
		Parallel Component Labeling
			Simple SIMD component labeling algorithms are very inefficient, taking order L iterations to label an L2 lattice
			These algorithms can be used effectively on MIMD machines, however. Use fast sequential algorithms to label sub-lattices, and simple SIMD algorithm to match clusters across processors. Takes order square root of P  iterations for an array of P processors. Good efficiencies if L>>P.
			On SIMD machines (CM-2, Maspar) multi-grid algorithms using regular (hypercube) communication, and multi-scale algorithms using general non-local communication, do labeling in order log(L) iterations. Inefficient due to poor load balancing and irregular clusters.	
		Independent Parallelism
			On coarse grain MIMD machines or workstation networks can parallelize over independent simulations with different random number streams. This is the only way to parallelize the Wolff algorithm efficiently.
			On massively parallel machines, can use a combination of domain decomposition and independent parallelism for Swendsen-Wang algorithm, e.g. run 8 independent simulations of 64 nodes each on 512 node nCUBE.
	MIMD algorithm on nCUBE-1
		 
	SIMD algorithms on CM-2
		 
	Autocorrelation Plots for Cluster Algorithms
		 
	HPF probably cannot express well
		Multiple Phase problems such as		
			Particle in the cell	
			Unstructured Multigrid	
		Need to express "alignment" between phases (different levels of multigrid, particles and grid for PIC) and then optimize distribution and redistribution between phases.	
		Irregular problems of this type are only suitable for MIMD.	
		Very irregular cases with mismatches between phases perform poorly even on MIMD.
	I do not know how to express - let alone optimize - 
		 
	Late breaking results
		Michael Warren (with John Salmon) has reformulated Barnes-Hut parallel algorithm	
		New parallel algorithm appears suitable for incorporation into extended HPF compiler as a generalization of PARTI methods used to parallelize irregular grids 
	Large N-Body Calculations     (Quinn, Salmon, Warren)
		Clustering algorithm (Appel, Barnes-Hut, Greengard)
		Can replace M body force calculation by one using center of mass of cluster
			Naive calculation	.5 N2 t2 particle
			Clustering		(20-50) N log N t2 particle
			(best when N >1000)
		Can do O(10,000) particles with O(N2) algorithm
		One to two orders of magnitude larger problem possible with clustering algorithm
	Hierarchical Decomposition
		 
	Ncube Speed Up of Barnes Hut Algorithm
		 
	17x106 Body Simulation
	Diameter 250Mpc
	Quinn, Salmon, Warren, Zurek
		 
	The largest "galaxy" halo (137,000 bodies) from the 8.8M body simulation 
	(Warren, Fullagar, Quinn, Zurek)
		 
	8M bodies - 10 Mpc diameter 
	Final state with ~700 resolved "galaxies" 
	(Warren, Quinn, Zurek)
		 
	Smooth Particle Hydro simulation of the collision of two "main sequence stars." 137,000 bodies (Warren,Davies)
		 
	Expressing Problem Structure in Language
		Key idea behind data-parallel languages - and perhaps all good languages		
		The language expresses problem and not the machine architecture		
		Need different approach to express functional parallelism	
		Use"object-oriented" feature when problems has natural objects	
		Do not use "object-oriented" features when objects are artifacts of target machine	
		Need data and functional (object) parallel paradigms in many problems - especially multidisciplinary
	The map of Problem ---> Computer is performed in two or more stages
		 
	From Problem to Machine Space as a Mapping
		 
	Different Granularities of Decomposition I
		We can divide problems and machines into interconnected modules - and we do this at different granularities																									We can consider two limiting cases	(I) High performance Low Latency links				Machine		o  Classic MPP such as Intel Paragon				Problem		o  QCD: Components are gluon fields. Links are local. 				Software model	o  High Performance Fortran		o  Currently only regular structure		o  Eventually all such tightly coupled loosely synchronous 		     problems	o  Fortran + Message passing
	Different Granularities of Decomposition II
		 
	Compiler/Interpreter Tradeoffs at Different Levels of Granularity
		Can divide a problem into 				"highly optimized" v. "flexible" 				at different levels	
		HPF Compiler				Component is full HPF Program 	Glue full programs together on bus	
		HPF Interpreter				Component is single MATLAB/APL/HPF array statement 	Glue data parallel coarse grain array statement 	"function" calls together
	The Mapping of Heterogeneous MetaProblems onto Heterogeneous MetaComputer Systems
		 
	Mapping of Irregular Grid
		 
	Finding General Maps for FortranD/HPF
		 
	Three Physical Optimization Methods for Allocating Data to Multicomputer Nodes
		Three methods using physical analogies				1. Simulated Annealing Algorithm (SAA) 				Formulate optimization as finding ground state of 	physical systems. Use Monte Carlo to equilibrate and 	reduce temperature				2. Bold Neural Network (BNN) 				Mean field approximation to same analogy as in SAA  	to  find quick  minima. Related to "Eigenvalue 	Bisection" methods.				3. Genetic Algorithm (GA) 				Maximize fitness (minimize cost function) of evolving 	population  of structures (candidate solutions).
	Finite Element Mesh
		Click here for subtitle 
		 
	Graph Contraction
		 
	Some Results of Load Balancing Studies I
		 
	Some Results of Load Balancing Studies II
		
	Physics Analogy for Load Balancing
		 
	Complex System SHLSoft governed by Hamiltonian = Execution Time
		 
	PHYSICS ANALOGY FOR STATIC AND DYNAMIC LOAD BALANCING
		 
	Definition of Temperature for a Complex System
		 
	Particle dynamics problem on a four node system
		 
	Energy Structure in Physics Analogy with Multiple Minima
		 
	Phase Transitions in Physical model -- Scattered versus Domain Decompositions