GCF Talk U - Lewis/NASA 3/9/92 Overview of High Performance Computing for Physical Sciences Geoffrey C. Fox gcf@nova.npac.syr.edu 315.443.2163 Mardi Gras Conference Concurrent Computing in the Physical Sciences Louisiana State University Baton Rouge, LA February 18, 1993 Performance of Computers as a Function of Time Performance of QCD Machines When will Parallel Computing Take Over ? This is a phase transition Needs headroom (Carver Mead) which is large (factor of 10 ?) due to large new software investment nCUBE-1 and CM-2 were comparable in cost performance to conventional supercomputers Enough to show that "Parallel Computing Works" Not enough to take over! Intel Paragon, CM-5, DECmpp (MP-2), IBM SP-1 have enough headroom? Parallelism Implies Different machines New types of computers New libraries Rewritten Applications Totally new fields able to use computers .... ==> Need new educational initiatives Computational Science Will be a nucleus for the phase transition and accelerate use of parallel computers in the real world Program in Computational Science Implemented within current academic framework We have learnt that Parallel Computing Works ! Data Parallelism - universal form of scaling parallelism Functional Parallelism - Important but typically modest speedup. - Critical in multidisciplinary applications. On any machine architecture Distributed Memory MIMD Distributed Memory SIMD Shared memory - this affects programming model This affects generality SIMD ~ 50% academic problems but < 50% commercial METHODOLOGY Simple, but general and extensible to many more nodes is domain decomposition All successful concurrent machines with Many nodes High performance (this excludes Dataflow) Have obtained parallelism from "Data Parallelism" or "Domain Decomposition" Problem is an algorithm applied to data set Obtain concurrency by acting on data concurrently. The three architectures considered here differ as follows: MIMD Distributed Memory Processing and Data Distributed MIMD Shared Memory Processing Distributed SIMD Distributed Memory Synchronous Processing on Distributed Data Sideways View of Concurrent Computing Mappings Sample uses of Concurrent Computers - 4 Nodes Parallel Computing Trends to Year 2000 Hardware trends imply that all computers (PC's ---> Supercomputers) will be parallel by the year 2000 Upto now parallel computers are from small start-up companies (except Intel Supercomputer Division) Recently Cray, Convex (HP), Digital, IBM have announced massively parallel computing initiatives Several architectures but only one : Distributed memory MIMD multicomputer is known to scale from one to very many processors Software is challenge and could prevent/delay hardware trend that suggests parallelism will be a mainline computer architecture first we must get systems software correct then applications software 39 National Grand Challenges 39 National Grand Challenges (II) 39 National Grand Challenges (III) 39 National Grand Challenges (IV) Most Grand Challenges are Partial Differential Equations Particle Dynamics and Multidisciplinary Integration Image Processing Some: Visualization Artificial Intelligence Not Much: Network Simulation Economic (and other complex system) modeling Scheduling Manufacturing Education Entertainment Information Processing BMC3IS (Command & Control in military war) Decision Support in global economic war ACTION Motivation Industrial and Government Problems Are a factor of 10 to 100 larger (up to 106 lines) Have a longer lifetime Less accessible expertise The parallel hardware must be much better (factor of >ten "headroom" in Carver Mead's model) than sequential competition. This will happen in the next two years. Must have better software to port and maintain large parallel codes -- can only convert once! Are software issues the same for academic and industrial production codes? Probably not? So we need to integrate computer science (software) research and development with experience from industrial codes. Applications of Interest to Industry IA Applications of Interest to Industry IB Applications of Interest to Industry IC Pluses and Minuses of Parallel Computing We know parallel versions for basic algorithms currently being used in most kernels There are some counter examples: 1) Parallelize incomplete LU factorization for irregular sparse matrices 2) Develop new neural network to predict and hence avoid hurricanes or avoid computer failures.... 3) Optimal preconditioning for 3D partial differential equation is not known in many applications --- to be expected as little experience. Can't use a computer of performance 10P without substantial experience on a computer with performance P P C E T E C H Ñ A Result of The study of implementation of ACTION What can a small enterprise like ACTION offer much bigger industries? Study of role of applications in CRPC What can a relatively small group of computer scientists (CRPC) offer the larger national HPCCI ? Study of relation of CRPC and NSF Supercomputer Centers which are again much larger Ñ> Need to concentrate in areas where CRPC (ACTION)'s unique capabilities have high value P C E T E C H includes and expands technology development components of CRPC and ACTION The Life History of an HPC Idea Core Enabling Software Technologies PVM, Express, Linda Einstein (CSW) ISIS (Cornell) High Performance Fortran Compiler High Performance C, C++ Compiler HPF Extensions - PARTI Parallel / Distributed Computing Runtime Tools ADIFOR AVS and Extensions MOVIE (Syracuse) High Performance Fortran Interpreter Televirtuality / Virtual Reality OS Parallel Database - Oracle 7.0 OO Database Integration Methodology for above software Heterogeneous Network Target Core Enabling Algorithms and Components Mesh Generation SCALAPACK Sparse Matrix Solvers - Templates and libraries (Direct and Iterative) Particle Dynamics Kernels - Templates and Libraries ( O(N2) to fast multipole) Optimization Methodology and Templates Linear programming Non-linear programming Scheduling (neural-net, parallel) Templates Portable Scalable Languages We may have to rewrite our code, but we want the resulting program to run on "All" current and future (anticipated) machines No compelling new languages e.g. OCCAM does not solve enough problems to warrant adoption Adapt existing languages Fortran C++ ADA C LISP PROLOG BASIC ..........COBOL ? Can map onto individual sequential or parallel computers or networks of these (distributed computing) Problem Architectures Software Problem Architecture Interactions Why build on existing languages - especially Fortran / C Experience of users Good sequential and node compilers exist Migration of existing codes Hard to build a new language as need all the features/libraries of Fortran / C implemented well. This is why one uses Fortran 90 and not APL Fortran and C are good (OK) for a class of applications Fortran( C ) Plus Message Passing Natural MIMD model of programming Used in vast majority of current successful MIMD programming Each node has program which controls and calculates its data (owner-computes "rule" ) Do i run over local data CALL READ (for non-local data) CALCULATE ALL i's DATA In simplest form one program per node of computer Fortran( C ) Plus Message Passing Advantages Portable to MIMD distributed and shared memory machines - should scale to future machines with some difficulties (overlapping communication with itself and calculation) Available now Will be industry standards Express, PICL, Linda now "All Problems" can be written in it Disadvantages Can be "hard-work" for user inserting communication calls Optimizations not portable Only MIMD Data Parallel Fortran e.g. Program represented as sequence of functions operating on arrays Dimension: A(100,100), B(100,100), C(100,100) A = B + C A = B + Shift in x, y (C) forall i, j A(i,j) = B (i,j) + C (i-1, j) + ...... In APL, Fortran90 - like syntax or with explicit loops All data owned by a particular processor Owner computes rule: for an expression A(i,j) = .........., Processor holding A(i,j) calculates expression after communicating any needed data if r.h.s. contains data not owned by processor. Similar concepts in C, C++, LISP, ADA ....... Data Parallel Fortran ( C, C++, LISP, ADA ) Advantages Very portable and scalable to SIMD and MIMD Should be able to eventually handle ~all synchronous and loosely synchronous problems including ones that only run well on MIMD Industry Standard will be adopted "High Performance Fortran" hope for consensus ~December 1992 include CMFortran, FortranD ..... Disadvantages Need to wait for good compiler Not all problems can be expressed in it The Origins of High Performance Fortran Strategy for HPF Program in an SPMD data parallel style User writes code with (conceptually) a single thread of control and globally accessible data Program is annotated with assertions giving information about desired data locality and/or distribution Compiler generates code implementing data and work distribution from the user-written SPMD application For MIMD - implementation is multi-threaded message passing with local data; generated synchronization and send/receive code for SIMD - implementation is parallel code with communication optimized by complier placement of data for vectors - implementation is vectorized code optimized for the vector units for RISC - implementation is pipelined superscalar code with compiler generated cache management Model for Scalable Languages Dynamically Triangulated Random Surfaces String theories are candidates for a TOE (Theory of Everything). They provide a plausible quantum theory of gravity. Basic objects are tiny one dimensional "strings" (rather than points). Dynamics described by a two dimensional space-time world-sheet (rather than world line). Calculations involve integrating over all possible world-sheets, i.e. all 2-d surfaces embedded in some higher dimensional space (3-d, 4-d, 26-d, ....) Done numerically by a Monte Carlo integration over different triangulations of the surface, produced dynamically using the Metropolis algorithm. Current research centers on finding good discretizations of the continuum theories, and trying to find critical points with divergent correlation lengths, so the details of the discrete lattice structure become irrelevant and we can extract continuum results (c.f. lattice QCD) Interesting critical points are phase transitions between a smooth phase and a crumpled phase of the surface. The order and critical exponents of this crumpling transition are still uncertain. We have done the largest simulations so far on 2-d systems, but these are only 576 node meshes. Critical slowing down is a major problem - need better algorithms. The same techniques are used to model real fluctuating surfaces, e.g. cell membranes, lipid bilayers, phase boundaries. Update Strategies for DTRS N=36 Wireframe DTRS N=144 Lambda =1.5 Colored DTRS N=144 Lambda=1 Colored DTRS Computational Issues and Code Performance Optimization Program for random surfaces is very irregular, with many linked list and pointer trees to traversed. Original code performed very poorly. Completely rewrote and simplified code and gained a substantial speed-up. Still performed poorly on some RISC processors such as the Intel i860. Optimal use of cache memory is vital for processors such as the i860, which have a small cache memory, and the time to access external memory is much greater than the time to perform a floating point operation Due to the dynamic nature of the code, the neighbors of a site which are used for the update of that site are constantly changing. This means we cannot make good use of the cache unless we keep a new array which constantly readjusts pointers to neighbors, so that data for neighboring sites is kept in neighboring regions of memory. Other optimizations on i860 and use of non-IEEE arithmetic gave us a substantial improvement, but still nowhere near 80 Mflops peak performance! Program performs much better on more recent processors (IBM RS/6000, Hewlett-Packard). Sequential Computer Architecture Performance of Random Surface Code Parallel Algorithms Independent Parallelism Have so far used independent parallelism on MIMD machines and workstation networks Due to strong critical slowing down can only be used for small meshes, since producing an independent mesh for each processor takes so long. Need to parallelize over a single simulation. Parallel Random Surfaces Load Balancing - Poor load balancing on SIMD machines, since the number of links per node can range from 3 to 20. Dynamic Domain Decomposition - Dynamic mesh means that neighboring points will change during the simulation and eventually be on distant processors, requiring substantial non-local communication. Need dynamic load balancing to preserve local communications (c.f. dynamic adjustment of arrays to keep neighboring points in fast cache memory on sequential machines. Detailed Balance - The most difficult problem is to preserve detailed balance. A site cannot be updated at the same time as the other sites on which its value depends. For a regular 2-d grid, use black/red checkerboard update, For a dynamic, irregular grid, identifying independent sites is a time-consuming problem which must be redone before every iteration. Irregular Adaptive Grids for PDEs - This problem has much in common with parallel algorithms for solving PDEs on irregular, adaptive grids (e.g. CFD) Implementation - Currently developing parallel code for CM-5 and Intel Delta Parallel Grid Generation What is Fortran 90? FORTRAN 77 entirely contained within Fortran 90 Language goal is to "modernize Fortran, so that it may continue its long history as a scientific and engineering programming language" Major language additions include: array operations numeric computation facilities parameterized intrinsic data types modularzation of data and procedures pointer facilities many more intrinsic functions A significantly larger language than FORTRAN 77 HPF Language Features - 1 Data Distribution Directives locality of reference TEMPLATE directive ALIGN and REALIGN directives DISTRIBUTE and REDISTRIBUTE directives DYNAMIC directive PROCESSORS and processor VIEW directives Parallel Statements FORALL statement and construct, PURE procedures INDEPENDENT directive Extended Intrinsic Functions and Standard Library system inquiry functions computational functions distribution inquiry functions HPF Language Features - 2 EXTRINSIC Procedures escape from global name space model EXTRINSIC directive Fortran 90 binding for EXTRINSIC interface Parallel I/O Statements none Changes in Sequence and Storage Association restrictions SEQUENCE directive In addition, a subset of HPF suitable for earlier implementation is defined FORTRAN-90D The First Implementation of HPF (NPAC, Syracuse University) Current Status Alpha version: demonstrated at SUPERCOMPUTING '92 in Minneapolis, MN one- and two-dimensional BLOCK, CYCLIC and BLOCK-CYCLIC distributions FORALL statement full run-time support for regular communications PARTI (ICASE) primitives for irregular communications intrinsic library Now running on nCUBE2, iPSC/860 and network of SUN workstations beta release: June 1993 alliance with industry to develop commercial compiler for HPF, optimized for specific architectures next target: heterogeneous networks Fortran90D Performance on Gaussian Elimination Gaussian Elimination Example of Fortran90D Source Code Fortran90D Interface HPF Directives Syntactic form of a structured Fortran environment A program that contains directives is expected to generate the same results whether directives are processed or not, assuming that the directives are used correctly HPF directives are consistent with Fortran 90 if HPF were to be adopted as part of a future Fortran standard, only the comment character and the directive prefix would have to be removed from HPF directives directives follow either free source form or fixed source form Data Alignment Specifies lowest-level abstract alignment of array elements TEMPLATE is an abstract, very fine-grain virtual processor space Template labels physically independent objects e.g. gluons, fluid particles. Align degrees of freedom e.g. gluon components to template members Use for machine-independent alignment between structures Data Distribution Specifies coarse-grain partioning of data PROCESSORS is a coarse-grain processor grid, presumably in 1-1 correspondence with physical processors Use for machine-dependent partitioning of data structures Compiler will normally use "owner-computes rule" FORALL Elemental array assignment statement (Similar to array expressions) FORALL is not a general parallel construct! Functions without side effects may be called from FORALL Other functions may be allowed as well DO INDEPENDENT An assertion that the iterations of the DO may be executed in any order or concurrently If the assertion is false, the compiler is free to take any action it deems necessary No restrictions on the loop body No way to perform explicit synchronization HPF Library Much of HPF's Parallel "Power" is buried in its library Personal note: Fortran 90D motivated by similarities between Caltech message passing "collective" communication and F90 Intrinsics 18 Non-elemental (true array) Fortran 90 Intrinsics e.g. CSHIFT, SUM 7 new HPF Intrinsics e.g. Alignment, Template, Distribution inquiries 4 new HPF Reduction (combine) Functions 11 new HPF Combine-Scatter Functions 22 new HPF Parallel Prefix Functions 2 new HPF Parallel Sorting Functions 3 other new HPF Parallel Functions 7 HPF Local Intrinsics 74 library routines (==> ~1000 routines when written in clumsy F77) Intrinsic Library HPF Library Fortran 90 Local Routine Intrinsics Does HPF need a different type of compiler ? HPF makes much greater use of runtime optimizations than the traditional sequential or parallel compiler Fortran 77: Do 1 I=1,N 1 A(I) = FUNC(B(I), B(I +/- 1), C(I), C(I +/- 1), .....) Traditionally arrange access to these elements in COMPILER HPF: A=FUNC(B,C) Invoke ALREADY OPTIMIZED parallel implementation of FUNC In many cases, HPF compiler need not know anything about parallelization except interfaces, data layout etc. HPF Interpreter Runtime optimizations acceptable (and perhaps best) as "grain size" (i.e. total number elements in array divided by Nproc) is large ==> "Startup" of Runtime Procedure calls is small ==> Can use an interpreter rather efficiently cf. APL and *LISP on CM-2 There is only ONE data parallelism ==> Same Runtime library can support C, C++, ADA APL, HPF Interpreter Benchmarking Set ( Fortran90D and Fortran 77 ) HPF/FORTRAN-90D Benchmark Suite The final draft of the HPF Language Specification is version 1.0 - DRAFT, dated January 25, 1993 Anonymous FTP minerva.npac.syr.edu (directory HPFF) titan.cs.rice.edu (directory public/HPFF/draft think.com (directory public/HPFF) ftp.gmd.de (directory hpf-europe) theory.tc.cornell.edu (directory pub) Electronic Mail from the Softlib server at Rice University. This can be accessed in two ways: A. Send electronic mail to softlib@cs.rice.edu with "sendhpf-v10.ps" in the message body. This report is sent as a Postscript file. B. Send electronic mail to softlib@cs.rice.edu with "send hpf-v10.tarZ" in the message body. The report is sent as a uuencoded compressed tar file containing LaTeX source Hardcopy Send requests to Theresa Chatman CITI/CRPC, Box 1892 Rice University Houston, TX 77251. There is a charge of $50.00 for this report to cover copying and mailing costs. Request for Public Comment on High Performance Fortran HPFF encourages reviewers to submit comments as soon as possible, with a deadline of March 1, 1993 for consideration. Send comments to hpff-comments@cs.rice.edu. To facilitate the processing of comments we request that separate comment messages be submitted for each chapter of the document and that the chapter be clearly identified in the "Subject:" line of the message. Comments about general overall impressions of the HPF document should be labeled as Chapter 1. All comments on the language specification become the property of Rice University. If email access is impossible for comment responses, hardcopy may be sent to: HPF Comments c/o Theresa Chatman CITI/CRPC, Box 1892 Rice University Houston, TX 77251 HPFF plans to process the feedback received at a meeting in March. Best efforts will be made to reply to comments submitted. Application Structure Versus Fortran90D Features What applications does HPF support? Somewhat related issue ? What problems are suitable for HPF and make good use of MIMD architectures ? i.e. run well on MIMD but poorly on SIMD Classification of Problems (Fox, 1988) Current HPF can do I ..... Current HPF can do II..... Embarrassingly Parallel Problems Basic entities can be very complicated but they can be calculated independently. Use INDEPENDENT command in HPF Search databases for given text ( ELIXIR for DECmpp ) (can be SIMD) Analyse separately 107 events recorded from Synchrotron (SSC) collisions. ( definitely MIMD "farm" ) Calculate matrix elements for (chemistry) Hamiltonian (may need MIMD) MOPAC, GAUSSIAN, Reaction scattering rate work within C3P at Caltech is "calculate matrix elements" "Multiply/Solve/Find eigenvalues of matrix Quantum Monte Carlo Independent "walkers" Parallel Histogram (database) storage HPF can express using FORALL Particle Dynamics with cutoff forces (Vij = 0 | ri - rj | > d ) and irregular spatial structure CHARMM, AMBER .............. Unstructured finite element meshes Where data structure involves array of pointers. However need new distribution primitives to produce good performance. It is known how to optimize such unstructured problems for both static and adaptive cases Performance of SIMD or Autonomous SIMD (Maspar) versus MIMD unclear for these applications. Classic Irregular Mesh Application HPF Can Express Simulation of Spin Models of Magnets Simple models reproduce quantitative behavior of real magnetic materials with complex interactions Fundamental to the development of the theory of phase transitions and critical phenomena, which has very wide applicability. Computational algorithms and techniques can be applied to other fields, e.g. Monte Carlo simulation of lattice models of quantum field theory and quantum gravity simulation of polymers, proteins, cell membranes simulated annealing and other "physical" methods for global optimization neural networks condensed matter problems, e.g. superconducters Since the models are simple and exact results are known in some cases (e.g. 2-d Ising Model), these are used as a testing ground for new computational techniques, e.g. Monte Carlo Algorithms measurements and data analysis random numbers generators New Monte Carlo Algorithms The Metropolis Algorithm Major problem in Monte Carlo simulation is critical slowing down. Standard algorithms (e.g. Metropolis) do single site, local updates. Number of iterations needed to change the large-scale structure and produce a new uncorrelated configuration diverges as a power of lattice size at a critical point. Cluster Algorithm New multi-spin, non-local algorithms (Swendsen-Wang, Wolff) rapidly change large-scale structure by identifying clusters of sites to be updated at once, greatly reducing critical slowing down. Currently only applicable to a limited class of models Ongoing research includes Extensions to frustrated spin models (e.g. spin glasses) where critical slowing down is extreme Precise measurements of autocorrelations and dynamic critical exponents to help understand dynamics of new algorithms Application of new algorithms to simulation of spin models, e.g. O(3) model, fully frustrated Ising model Parallel cluster algorithms Simulated Tempering New Method of making small changes in temperature while keeping system in equilibrium. Applications include: Allowing tunneling between states at first order phase transitions (e.g. random field Ising model) Global optimization (a la simulated annealing) Magnetization for Random Field Ising Model -- Metropolis Magnetization for Random Field Ising Model -- Simulated Annealing Wolff cluster (bands shown in yellow) for 3 state Potts model at Tc Swendsen-Wang clusters (boundaries shown in black) for 3 state Potts model at Tc Cluster Algorithm versus Metropolis Parallel Algorithms Metropolis Algorithm Regular, local algorithm Local communication, perfect load balance using standard domain decomposition Efficiently parallelizable on SIMD and MIMD machines Swendsen-Wang Algorithm Irregular, non-local, multiple cluster algorithm Main computational task is to identify clusters - connected component labeling, also used in image processing Clusters are highly irregular (fractal) in shape, and of all sizes (usually including one large cluster covering much of the lattice) Efficient parallel component labeling is a very difficult problem - poor load balancing, lots of non-local communication Wolff Algorithm Irregular, non-local, single cluster algorithm Standard domain decomposition gives very poor load balance and very limited parallelism Traditional simple, regular, local algorithms parallelize efficiently Newer irregular, adaptive, non-local algorithms are better sequentially but often more difficult to parallelize Parallel Cluster Algorithms Parallel Component Labeling Simple SIMD component labeling algorithms are very inefficient, taking order L iterations to label an L2 lattice These algorithms can be used effectively on MIMD machines, however. Use fast sequential algorithms to label sub-lattices, and simple SIMD algorithm to match clusters across processors. Takes order square root of P iterations for an array of P processors. Good efficiencies if L>>P. On SIMD machines (CM-2, Maspar) multi-grid algorithms using regular (hypercube) communication, and multi-scale algorithms using general non-local communication, do labeling in order log(L) iterations. Inefficient due to poor load balancing and irregular clusters. Independent Parallelism On coarse grain MIMD machines or workstation networks can parallelize over independent simulations with different random number streams. This is the only way to parallelize the Wolff algorithm efficiently. On massively parallel machines, can use a combination of domain decomposition and independent parallelism for Swendsen-Wang algorithm, e.g. run 8 independent simulations of 64 nodes each on 512 node nCUBE. MIMD algorithm on nCUBE-1 SIMD algorithms on CM-2 Autocorrelation Plots for Cluster Algorithms HPF probably cannot express well Multiple Phase problems such as Particle in the cell Unstructured Multigrid Need to express "alignment" between phases (different levels of multigrid, particles and grid for PIC) and then optimize distribution and redistribution between phases. Irregular problems of this type are only suitable for MIMD. Very irregular cases with mismatches between phases perform poorly even on MIMD. I do not know how to express - let alone optimize - Late breaking results Michael Warren (with John Salmon) has reformulated Barnes-Hut parallel algorithm New parallel algorithm appears suitable for incorporation into extended HPF compiler as a generalization of PARTI methods used to parallelize irregular grids Large N-Body Calculations (Quinn, Salmon, Warren) Clustering algorithm (Appel, Barnes-Hut, Greengard) Can replace M body force calculation by one using center of mass of cluster Naive calculation .5 N2 t2 particle Clustering (20-50) N log N t2 particle (best when N >1000) Can do O(10,000) particles with O(N2) algorithm One to two orders of magnitude larger problem possible with clustering algorithm Hierarchical Decomposition Ncube Speed Up of Barnes Hut Algorithm 17x106 Body Simulation Diameter 250Mpc Quinn, Salmon, Warren, Zurek The largest "galaxy" halo (137,000 bodies) from the 8.8M body simulation (Warren, Fullagar, Quinn, Zurek) 8M bodies - 10 Mpc diameter Final state with ~700 resolved "galaxies" (Warren, Quinn, Zurek) Smooth Particle Hydro simulation of the collision of two "main sequence stars." 137,000 bodies (Warren,Davies) Expressing Problem Structure in Language Key idea behind data-parallel languages - and perhaps all good languages The language expresses problem and not the machine architecture Need different approach to express functional parallelism Use"object-oriented" feature when problems has natural objects Do not use "object-oriented" features when objects are artifacts of target machine Need data and functional (object) parallel paradigms in many problems - especially multidisciplinary The map of Problem ---> Computer is performed in two or more stages From Problem to Machine Space as a Mapping Different Granularities of Decomposition I We can divide problems and machines into interconnected modules - and we do this at different granularities We can consider two limiting cases (I) High performance Low Latency links Machine o Classic MPP such as Intel Paragon Problem o QCD: Components are gluon fields. Links are local. Software model o High Performance Fortran o Currently only regular structure o Eventually all such tightly coupled loosely synchronous problems o Fortran + Message passing Different Granularities of Decomposition II Compiler/Interpreter Tradeoffs at Different Levels of Granularity Can divide a problem into "highly optimized" v. "flexible" at different levels HPF Compiler Component is full HPF Program Glue full programs together on bus HPF Interpreter Component is single MATLAB/APL/HPF array statement Glue data parallel coarse grain array statement "function" calls together The Mapping of Heterogeneous MetaProblems onto Heterogeneous MetaComputer Systems Mapping of Irregular Grid Finding General Maps for FortranD/HPF Three Physical Optimization Methods for Allocating Data to Multicomputer Nodes Three methods using physical analogies 1. Simulated Annealing Algorithm (SAA) Formulate optimization as finding ground state of physical systems. Use Monte Carlo to equilibrate and reduce temperature 2. Bold Neural Network (BNN) Mean field approximation to same analogy as in SAA to find quick minima. Related to "Eigenvalue Bisection" methods. 3. Genetic Algorithm (GA) Maximize fitness (minimize cost function) of evolving population of structures (candidate solutions). Finite Element Mesh Click here for subtitle Graph Contraction Some Results of Load Balancing Studies I Some Results of Load Balancing Studies II Physics Analogy for Load Balancing Complex System SHLSoft governed by Hamiltonian = Execution Time PHYSICS ANALOGY FOR STATIC AND DYNAMIC LOAD BALANCING Definition of Temperature for a Complex System Particle dynamics problem on a four node system Energy Structure in Physics Analogy with Multiple Minima Phase Transitions in Physical model -- Scattered versus Domain Decompositions