GCF Talk 93A Complex Systems and Parallel Computing Australian International Conference on Complex Systems Australian National University Canberra, Australia December 14-16, 1992 Geoffrey C. Fox gcf@nova.npac.syr.edu http://www.npac.syr.edu 315.443.2163 The Three Themes of Lecture: Parallel Computers and Complex Systems Parallel Computers for the Simulation of Complex Systems Teraflop Performance Virtual Reality Software Standards "Global metacomputer" Complex Systems for the theory of Computer and Network Architecture Physical Analogies Space, Time, Temperature, Phase Transitions Problem Architecture A Theory of Parallel Computing? Complex Systems for new Computational Methodologies Physical Computation Neural Networks.......... Simulated Annealing and Tempering Issues in Parallel Computers for the Simulation of Complex Systems Can one use the 1 ----->10 TERAFLOP machines which will be available in the middle to late 1990's ? What software do you need ? What about virtual reality and distributed metacomputers ? Standard Performance Graph Heading to 1 to 10 Teraflops by year 2000 When will parallel computing take over? When it has enough headroom (e.g. a factor of ten better "performance"/"cost-performance" than existing technology) Definition of headroom confused as parallel machines much harder to use and so hard to decide on vlue to a given organization nCUBE-1, CM-2 were comparable or perhaps a factor of 2 ... 4 better in cost-performance than conventional supercomputers Enough to show "Parallel Computing Works" not enough to "takeover" DECmpp, IBM SP1, INTEL Paragon, CM-5, nCUBE-2 are (perhaps) a factor of ten better than competitive sequential (vector) machines They can "takeover" ? The President's High Performance Computing and Communication Initiative (HPCCI) $2.9 billion over 5 years The Grand Challenges Enabled by teraflop computers and important to economy or fundamental research Global warming - NOAA Oil reservoir and environmental simulation - DOE Structural and aerodynamic calculations - NASA Earth observing satellite - data analysis - NASA Human genome - NIH, DOE Quantum chromodynamics - Fundamental Physics Gravitational waves from black holes - Fundamental Physics Molecular modeling - Fundamental Chemistry Nearly all have industrial payoff but technology transfer NOT funded by HPCCI Challenges and Status of Parallel Computing Hardware trends imply that all computers (PC's--->Supercomputers) will be parallel by the year 2000 Up to now parallel computers are from small start-up companies (except Intel Supercomputer Division) Recently Cray, Convex (HP), Digital, IBM have announced massively parallel computing initiatives Several architectures "work" but only one : Distributed memory MIMD multicomputer is known to scale from one to very many processors on general problems Software is challenge and could prevent/delay hardware trend that suggests parallelism will be a mainline computer architecture first we must get systems software correct then applications software High Performance Fortran Overview Data Parallelism Fortran90 provides a number of features, notably array syntax, that make it possible for a compiler to detect parallelism. However, with existing compiler technology, to exploit full capability of concurrent systems, new language features are necessary to define data dependency and locality. High Performance Addressed primarily by proposed data distribution features and new intrinsic functions. Code Tuning optional, explicit message passing support for MIMD machines Validation a benchmarking suite is being prepared by NPAC, Syracuse University HPF computational model The underlying intuition behind HPF computational model is that an operation on two or more data object is likely to be carried out much faster if they all reside on the same processor it may be possible to carry out many such operations concurrently if they can be performed on different processors The main mechanism for achieving parallelism in HPF is the data partition annotation of HPF. Distribution of control is derived from the data distribution, using the owner compute rule: each assignment is executed by the processor(s) that own the assigned variable. (However, a compiler may relax the owner rule, in order to improve performance.) Example of Fortran-90D source code: Gaussian Elimination HPF directives An HPF directive has the form of a comment line: !HPF$ hpf-directive A goal of the design is that HPF directives should be consistent with Fortran 90 syntax in the following strong sense: if HPF were to be adopted as a part of a future Fortran standard, the only change necessary to convert an HPF program would be to remove the comment character and directive prefix from each directive. There is a clear separation between directives that serves as specification statements and directives that serve as executable statements ( in the sense of the Fortran standard). Specification statements are carried out on entry to a program unit; after these, executable statements are carried out. Data Alignment and Distribution Directives Model of the allocation of data objects to processor memories is that there is a two-level mapping of data objects to abstract processors. Data objects (typically array elements) are first aligned with an abstract indexing space called a template; a template is then distributed onto a recilinear arrangement of abstract processors. The implementation then uses the same number, or perhaps some smaller number, of physical processors to implement these abstract processors. This mapping of abstract processors to physical processors is implementation dependent. Examples of Alignments (1) REAL, ARRAY (N, N) : : X3 , X4 !HPF$ TEMPLATE B (N, N) !HPF$ ALIGN X3 (I, J) with B (I, J-1) !HPF$ ALIGN X4 (I, J) with B (I-1, J+2) Examples of Distributions (1) !HPF$ TEMPLATE A (N, N), B (N, N), C (N, N) !HPF$ DISTRIBUTE A (BLOCK, *) !HPF$ DISTRIBUTE B (*, BLOCK ) !HPF$ DISTRIBUTE C (BLOCK, BLOCK ) For More Information on HPF The final draft of the HPF proposal is available via anonymous ftp from minerva.npac.syr.edu (public/HPFF) titan.cs.rice.edu (public/HPFF/draft) ftp.gmd.de (hpf-Europe) theory.tc.cornell.edu (pub) think.com (public/HPFF) E-mail servers softlib@cs.rice.edu (send hpf-vxx.ps) netlib@research.att.com (send hpf-vxx.ps from hpff) netlib@ornl.gov (send hpf-vxx.ps from hpf) HPFF mailing list hpff-request@cs.rice.edu (to join) hpff@cs.rice.edu (to submit) hpff-info@cs.rice.edu (for general information) hpff-comments@cs.rice.edu (to submit comments on the HPF draft) FORTRAN-90D The First Implementation of HPF (NPAC, Syracuse University) Current Status Alpha version: demonstrated at SUPERCOMPUTING '92 in Minneapolis, MN one- and two-dimensional BLOCK, CYCLIC and BLOCK-CYCLIC distributions FORALL statement full run-time support for regular communications PARTI (ICASE) primitives for irregular communications intrinsic library Now running on nCUBE2, iPSC/860 and network of SUN workstations beta release: June 1993 alliance with industry to develop commercial compiler for HPF, optimized for specific architectures next target: heterogeneous networks Common Software needed for Heterogeneous Local Area Network (Ethernet - FIDDI - HIPPI - FCS ......) Importance of MetaProblems High Performance Fortran is only suitable for data parallel problems whose data structure is an array. Will extend e.g. to tree or irregular graph But complex systems can be complicated! For instance, Image Processing, Data Fusion, Multidisciplinary Design have several modules. Each can be data parallel Hybrid Problem Structure for Command and Control Si: Satellite with: Image processing Kalman filter Stereo match FCi: Fire Control platform with loosely synchronous target-weapon pairing algorithm E: Overall Asynchronous expert system The Mapping of Heterogeneous Problems onto Heterogeneous Computer Systems The analogy between metaproblems and metacomputers SIMCITY is an interesting PC based complex system simulation. Can use teraflop computers to make such "games" more realistic. Heirarchy of Complexity Virtual Class e.g. weather or cortex model Virtual Class room Virtual Laboratory realistic SIMCITY; combines several classes Virtual School .... Experience virtual classes ---> school with "Virtual Reality" Implementation of Complex System Simulation Complex System Simulation requires ===> Support of many adaptively linked modules ===> Interpreter not compiler ===> Link to Virtual Reality Possible Implementation AVS as System Integration Tool Basic Features dataflow model for distributed computing AVS program = network of AVS modules, controlled by the AVS kernel via the RPC mechanism AVS module = C/Fortran program, conforming to AVS interface specification strong visualization and 3D graphics support visual editing tools for AVS networks Highlights easy to use and to port existing applications to the AVS framework growing popularity in research community growing volume of public domain modules, maintained by the International AVS Center at the NCSC Flaws poor interactivity (best for complex, static visualization, but not for simulation or animation) lack of organization principle to bookkeep dataflow modules (not an object-oriented system) lack of support for High Performance Computing not an Open Systems software (2 competitive commercial products: AVS, Explorer (SGI); 2 public domain systems: apE and Khoros) Parallel AVS - Planned Project at NPAC Forms of parallelism in the AVS model: intramodular data parallelism (e.g. image processing or PDE modules) for single AVS kernel intermodular functional parallelism (single kernel) multi-user/kernel/department functional parallelism NPAC Approach Develop 'standard' library of data parallel modules in High Performance Fortran for regular problems (matrix, algebra, PDEs, imaging) develop hierarchical/clustering algorithms for automatic decomposition of large AVS dataflow networks into a network of independent AVS subsystems unify existing tools for the user-level hierarchical functional organization with the automated tools for the network decomposition unify top-down approach for the single user network decomposition with the bottom-up approach for the multi-user network integration Architecture of Parallel AVS System VR Operating Shells Components of Proposed Televirtuality Server at NPAC modern parallel systems (CM-5, nCUBE2, DECmpp, Intel Touchstone) High Performance Fortran and Concurrent C++ software environment (compilers, interpreters, runtime support) parallel Oracle + SGI front-ends expert system (RT/AI from Intellisys, Inc.) Clusters of DEC/IBM/SUN workstations on the Ultra-Net / FCS / HIPPI / ATM (?) based gigabit/sec LAN visualization front-ends (AVS/Explorer, Khoros/apE) multimedia front-ends VR front-ends (BOOM, Convolvotron, DataGloves, PROvision and Sense8 systems, new consumer VR products) MOVIE as integration/infrastructure software A Theory of Parallel Computing based on Complex Systems Use Complex Systems to describe Problems Computers Networks And their Interaction Computing as a set of Mapping Problems A hierarchy of mapping problems We would like to optimize the overall mapping Complex Systems to give a theory of computing Both problems and computer are Collections of homogeneous or heterogeneous basic entities with some sort of static or dynamic connection between them Modeling involves mapping one "complex problem" into another Simulation or parallel computing involves map Complex problem ---> Complex Computer What parameters of these two underlying complex systems control effectiveness of map? Parallel Computing is "just" an optimization problem, even if we can't agree on what to optimize Execution time - main focus of HPCC community? User happiness - main focus of software engineering community Concurrent Computation as a Mapping Problem -I 2 Different types of Mappings in Physical Spaces Both are static a) Seismic Migration with domain decomposition on 4 nodes b)Universe simulation with irregular data but static 16 node decomposition but this problem would be best with dynamic irregular decomposition Concurrent Computation as a Mapping Problem - II Different types of Mappings -- A very dynamic case without any underlying Physical Space c)Computer Chess with dynamic game tree decomposed onto 4 nodes Computation as a map of a set of Complex Systems Domain Decomposition and Complex Systems ? All successful parallel computers with: Many nodes (massively parallel) High Performance (excludes architectures like dataflow) Have obtained parallelism from "data parallelism" or "domain decomposition" This is just the mapping of data ("spatial domain") of complex problem onto spatial structure of complex computer Whether architecture SIMD MIMD Distributed or Shared Memory Physical Analogy for Complex Computer Space is either Number of nodes of parallel computer or at a different resolution Number of memory locations Time is either measured by cycle time of node or at a different resolution (on CISC computers) measured by machine instructions The Physical Space/TimeAnalogy for a General Problem (The correspondence implied by theory of complex systems) Problems have a data domain- "SPACE" and the computation involves work done on the elements of this domain; label this work by - "TIME" Examples Simulation of a 3D physical system SPACE = "SPACE" TIME = "TIME" e.g. Wave equation in seismic exploration or motion of missiles after launch Full matrix operations (Inversion, multiplication) MATRIX elements = "SPACE" Label of ROWS eliminated or multiplied = "TIME" Gauss Seidel (SOR) Iterative approach to solving sparse matrix problems Set of nonzero matrix elements = "SPACE" ITERATION COUNT = "TIME" Some Temporal Properties of Computation General Space Time Complex System Picture for Problem to Computer Mapping e.g. On a sequential computer, all aspects of problem are mapped onto "TIME" in computer Seismograph: time dependence of earthquake mapped on spatial extent of recording paper. Computer Languages and Space - Time Properties High Performance Fortran Dimension A,B (10000) A=A+B Space (elements of arrays A,B) in problem is mapped into space (i.e. arrays preserved) in Virtual Problem ---> Compiler can recognize data parallelism ---> Good parallel performance on all machines Fortran77 Do 1 I = 1, 10000 1 A(I) = A(I) + B(I) Space in problem is mapped into time (successive iterations of DO loop) in Virtual Problem ---> Compiler has to "undo" this to recognize parallelism and generally fails ---> Good languages preserve problem architecture Information Dimension of a General Complex System d = geometric dimension for local problems d~2 computer chips (Rent's Rule) d=1 ALL long range force problems - whatever dimension Performance of a Parallel Computer Hierarchical Multicomputer Spatial and Temporal Decomposition Homogeneous Multicomputer Purely Spatial Decomposition Shared or Hierarchical Memory Computer Comparison of Cache and Distributed Memory Communication Overhead Extension of Space-Time Picture to treat Hierarchial memory and caches etc. Both Distributed and Hierarchial memory need DATA LOCALITY in memory references Minimize communication Minimize cache miss Hierarchal memory ----> "Time" Distributed memory ----> "Space" All overheads proportional to 1/[space-time volume] 1/d with space and time decomposition (cf. BLAS-3 for Matrices) High Performance Fortran expresses data locality ==> will improve performance of sequential machines Space-Time Decompositions for the concurrent one dimensional wave equation Typical Example of Mapping an Irregular Bunch of Grid Points Onto 8 Nodes Use of Physical Optimization in High Performance Fortran Problem is to Find MAP in DISTRIBUTION (MAP) in FortranD/HPF (High Performance Fortran A user (compiler) generated data decomposition User (compiler) supplies What is to be distributed "Graph" - algorithmic connection between entities to be distributed These entities are: particles grid points matrix elements ....... depending on problem Physics Analogy for Load Balancing Label data points (members of problem spatial domain by i Each i ---> Particle moving in space with discrete points (number of points in space = number of nodes in machine) and with topology defined by interconnection of nodes Let problem be defined by parameters: Execution time Ci for each i Communication time Vi,j between data points i and j (Vi,j often zero; nonzero only if i and j linked in computational graph) Complex System SHLSoft governed by Hamiltonian = Execution Time Need ground state of H Decomposition of an Arch onto 16 Processors in a Hypercube from Chrisochoides PHYSICS ANALOGY FOR STATIC AND DYNAMIC LOAD BALANCING The "Particles" and "springs"/forces between them including effect of the cost of redistribution General definition of temperature TS of a complex system In a dynamic load balancing problem (time dependent system), simulated annealing will be performed concurrently on the same computer ensemble used to simulate system. Suppose "optimal" annealing strategy used. There will be a minimum temperature TS to which it will be possible to "cool" system before time spent annealing is longer than gain due to improved load balancing. Note TS is independent of speed of computer - it is a property of system being simulated Particle dynamics problem on a four node system Block Scattered and Domain Decomposition Distributions Illustrated Instantaneous Energy Distribution for Time Dependent Domain Decomposition and Block Scattered Distributions Note scattered position is independent of time Time Averaged Energy for Adaptive Particle Dynamics Problem Note Domain decomposition can end up worse than block scattered after averaging A general theory of computation The particle - process analogy does not apply to Asynchronous problems Database Event driven simulations Phone network traffic Computer chess and other AI Loosely synchronous problems where fundamental entities move between nodes on a microscopic time scale (such problems can be called microscopically dynamic) Example - Global sums as used in DO 1 I=1, Limit 1 A(I) = 0 DO 2 J=1, Limit 2 2 A(I) = A(I) + B(I,J) Where I and J distributed over nodes of hypercube HISTORICALLY ONE OF THE MOTIVATIONS FOR THE RESEARCH WAS TO " AUTOMATE" THE KNOWN FOLD ALGORITHM FOLD CALCULATES MULTIPLE GLOBAL SUMS BY SUITABLY INTERLEAVING TREES OVER THE HYPERCUBE The String Formalism for Dynamic Computations Originally we used the analogy Process or "entity"-----> Particle Label by p -----> xp (t) POSITION Now generalize this to process or "fundamental thread of computation" ------> String or worldline Label by p ------> time ordered set {xp (t) } In this new formalism our loosely synchronous microscopically static problems are represented by Loosely Synchronous Static and Adaptive Problems in the String Picture Particles p1 and p2 interact with an attractive "communication force" where this force is carried by message (packets) which are the quantum of force So loosely synchronous microscopically static problems can be described by strings that vary slowly with time and it is sufficient to slice, every now and then, at a fixed time and consider as particles An initial approach to computational string dynamics or equivalently the Construction of the Energy Function The true problem is to minimize the length in the time direction of the string of maximum length. This can be treated directly in "full string dynamics" but here we will follow particle analogy and replace this minimax problem by a least squares formulation Again replace exact constraints (e.g. a node can only do one or a few things at a time) by at least squares constraints This in spirit of Hopfield and Tank New terms in energy (in field theory language) corresponding to the creation or annihilation or strings Full String Dynamics as an Interacting Field Theory One uses as dynamical variables the complete strings {xp(t)} or probably better specifies them with a neural formalism inspired by Hopfield and Tank's traveling salesperson work Sp (x,t) = 0 if p is not at x time t Sp (x,t) = 1 if p is at x time t with constraint S Sp (x,t) = 1 (one string in field) The associated dynamics is that of interacting strings with usual repulsive and attractive forces plus creation and annihilation. Complex systems suggest new computational methodologies Cellular Automata Physical Computation Neural Networks Genetic Algorithms Simulated Annealing Simulated Tempering Physical Optimization and Computation Approaches and their Field of Origin Examples of Physical Computation Cellular Automata Complex Systems Physical (Pertaining to Nature) optimization and associated field Genetic Algorithms Evolution Simulated Annealing Physics(Statistical) Neural Networks Biology (low level) Information Theory Electrical Engineering (Maximum Entropy) ¥ ¥ ¥ Elastic Networks Physics (Determiistic) Deterministic Annealing These can be compared with Heuristics Particular Problem Combinatorial optimization Mathematics Expert systems Computer Scienc (High level reasoning) Genetic Algorithms for Data Decomposition Work of Nashat Mansour Three Major Genetic Operators A system is specified by a string of (binary) digits and evolves with a set of transitions generated by Genetic Operators MultiScale Methods in Parallel Data Decomposition Parallel algorithms for Genetic Algorithm Simulated Annealing Neural Networks Are not "trivial" Are not identical to sequential algorithms Large problems (the "real world") require multiscale algorithms just as multigrid is faster than Gauss Seidel for large P.D.E.'s (take --> log Nnode, d=2,3) Results of Various Physical Optimization Methods for Data Decomposition This compares parallel Genetic(PGA), Simulated Annealing(PSA), Neural Network(PNN) and a wonderful heuristic(PRSB -recursive spectral bisection) A similar but Larger Problem This comes from work of Nashat Mansour Some Overall Questions Relevant In Classisfying Optimization Problems and Methods What is the shape of the objective function? Correlated or uncorrelated minima? Do you need the exact global minimum an approximate minimum In this case, do you want solution to always be within some tolerance of exact solution or, on the average, to be within tolerance Two Types of Global Mininum and their relation to Local Minima Local minima in "physics" problems are closely correlated with true minimum "Computer Science" grand challenges in optimization might look different Typical Formalism for Physical Optimization Minimize E = E(parameters y) y can be continuous or discrete, or a mix Introduce a fake temperature T and set b = 1/T Often T ~ distance scale at which you look at problem Probability of state y = exp(-bE)/Z where As b ® ¥, minimum E (ground state) dominates Find y(T) by minimizing Free Energy F = E - TS = -1/b log Z Global and Local Minima in Temperature Dependent Free Energy Annealing tracks global minima by initializing search at temperature T by minima found at temperature T + dT Movement at given temperature Global Minimum moving with decreasing temperature Comparison of Physical Optimization Methods Simulated annealing: find y(T) by Monte Carlo as mean over configurations at that temperature Neural networks: y is discrete. Find y by mean field approximation Elastic net: y is discrete, but use improved mean field including some or all constraints Deterministic annealing: leave important y out of sum S. Find by simple iterative optimization Some Applications of Deterministic Annealing General, Travelling Salesman, Quadratic Assignment Durbin and Willshaw, Frean Yuille Simic Scheduling Peterson and Soderberg (high school classes) Johnston (Hubble Space Telescope) Track Finding Rose, (Fox, Gurewitz) Ohlsson, Peterson, Yuille Robot Path Planning, Navigation Fox Character Recognition Hinton Image Analysis Geiger, Girosi Clustering, Vector Quantization (coding), Electronic Packaging Rose, et al. Simulated Tempering -- a New Approach to Monte Carlo Optimization/Simulated Annealing A new method Simulated Tempering by Marinari ( Rome, Syracuse) and Parisi (Rome) Conventional Monte Carlo methods do not work well for Random Field Ising Model = RFIM Spins si live on 3D lattice si = +/-1 to be determined Fixed Magnetic Field hi = |h| ri |h| = 1 ri = +/- 1 with random probability 1/2 The Conventional Simulated Annealing and its Problems for Random Field Ising Models What sort of NP complete optimization problems are like the RFIM ? Normally in Simulated Annealing, choose sequence bm = 1/Tm increasing m = 0,1,2 .... At each bm, equiliberate system according to weight function Pm = exp (-bm E) Monte Carlo gets stuck in local minima (which differ by approximately the square root of the system size from true minima) This local minimum problem is particularly severe for RFIM as several very widely separated sharp "almost-global minima" Traditional annealing copes with case where local minima clustered around a single global minimum Key Idea in The Tempering Approach Add b to list of dynamical variables b chosen out of {bm} with probability given by Prob ( b, {s} ) a exp { - bm E ({s}) + gm} b1 Increasing b2 b3 b4 b5 Now system can move up in temperature and jump barrier Don't have to decide on a priori annealing schedule i.e. list of bm and computer time to be spent at each bm. Goodbye! Many Choices - Which is best When?