CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Introduction to Driving Technology and HPCC Current Status and Futures Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of The Current Status and Futures of HPCC Overview of Course Itself! -- and then introductory material on basic curricula Overview of National Program -- The Grand Challenges Overview of Technology Trends leading to petaflop performance in year 2007 (hopefully) Overview of Syracuse and National programs in computational science Parallel Computing in Society Why Parallel Computing works Simple Overview of Computer Architectures SIMD MIMD Distributed (shared memory) Systems ... PIM ... Quantum Computing General Discussion of Message Passing and Data Parallel Programming Paradigms and a comparison of languages Basic Course CPS615 Contact Points Instructor: Geoffrey Fox gcf@npac.syr.edu 3154432163 Room 3-131 CST Backup: Nancy McCracken njm@npac.syr.edu 3154434687 Room 3-234 CST NPAC Administrative support: Nora Downey-Easter nora@npac.syr.edu 3154431722 Room 3-206 CST CPS615 Powers that be above can be reached at cps615ad@npac.syr.edu CPS615 Students can be reached by mailing cps615@npac.syr.edu Homepage will be: http://www.npac.syr.edu/projects/cps615fall96 See my paper SCCS 736 as an overview of HPCC status Course Organization Graded on the basis of Approximately 8 Homeworks which will be due Thursday of week following day given out (Tuesday or Thursday) Plus one modest sized project at the end of class -- must involve "real" running parallel code! No finals or written exams All material will be placed on World Wide Web(WWW) Preference given to work returned on the Web Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- I Overview of National Scene -- Why is High Performance Computing Important Grand Challenges What is Computational Science -- The Program at Syracuse Basic Technology Situation -- Increasing density of transistors on a chip Trends to year 2007 using Moore's Law (see UVC Video) Elementary Discussion of Parallel Computing including use in society why does parallel computing always "work" in principle Computer Architecture -- Parallel and Sequential Network Interconnections, SIMD v. MIMD, Distributed Shared Memory vectorization contrasted with parallism Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- II Simple base example -- Laplace's Equation How does parallel computing work This is followed by two sections -- software technologies and applications which are interspersed with each other and "algorithm" modules Programming Models -- Message Passing and Data Parallel Computing -- MPI and HPF (Fortran 90) Some remarks on parallel compilers Remarks on use of parallel Java Some real applications analysed in detail Chemistry, CFD, Earthquake prediction, Statistical Physics Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- III This introduction is followed by a set of "vignettes" discussing problem classes which illustrate parallel programming and parallel algorithms Ordinary Differential Equations N body Problem by both O(N^2) and "fast multipole" O(N) method Numerical Integration including adaptive methods Floating Point Arithmetic Monte Carlo Methods including Random Numbers Full Matrix Algebra as in Computational Electromagnetism Computational Chemistry Partial Differential Equations implemented as sparse matrix problems (as in Computational Fluid Dynamics) Iterative Algorithms from Gauss Seidel to Conjugate Gradient Finite Element Methods Three Major Markets -- Logic,ASIC,DRAM Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 L=Logic, D=DRAM, A=ASIC, mP = microprocessor Chip and Package Characteristics Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 Fabrication Characteristics Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 Electrical Design and Test Metrics Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994 Technologies for High Performance Computers We can choose technology and architecture separately in designing our high performance system Technology is like choosing ants people or tanks as basic units in our society analogy or less frivolously neurons or brains In HPCC arena, we can distinguish current technologies COTS (Consumer off the shelf) Microprocessors Custom node computer architectures More generally these are all CMOS technologies Near term technology choices include Gallium Arsenide or Superconducting materials as opposed to Silicon These are faster by a factor of 2 (GaAs) to 300 (Superconducting) Further term technology choices include DNA (Chemical) or Quantum technologies It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in" Architectures for High Performance Computers - I Architecture is equivalent to organization or design in society analogy Different models for society (Capitalism etc.) or different types of groupings in a given society Businesses or Armies are more precisely controlled/organized than a crowd at the State Fair We will generalize this to formal (army) and informal (crowds) organizations We can distinguish formal and informal parallel computers Informal parallel computers are typically "metacomputers" i.e. a bunch of computers sitting on a department network Architectures for High Performance Computers - II Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters In particular network latency is higher and bandwidth is lower for an informal HPC Latency is time for zero length communication -- start up time Formal high performance computers are the classic (basic) object of study and are "closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of Processor (computer) nodes Communication (internal) Network Linkage of Memory and Processors I/O (external network) capabilities Overall Control or Synchronization Structure There is no Best Machine! In society, we see a rich set of technologies and architectures Ant Hills Brains as bunch of neurons Cities as informal bunch of people Armies as formal collections of people With several different communication mechanisms with different trade-offs One can walk -- low latency, low bandwidth Go by car -- high latency (especially if can't park), reasonable bandwidth Go by air -- higher latency and bandwidth than car Phone -- High speed at long distance but can only communicate modest material (low capacity) Quantum Computing - I Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95 Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996 Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51 Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules Quantum Computing - II For a "new technology" computer to be useful, one needs to be able to conveniently prepare inputs, conveniently program, reliably produce answer (quicker than other techniques), and conveniently read out answer Conventional computers are built around bit ( taking values 0 or 1) manipulation One can build arbitarily complex arithmetic if have some way of implementing NOT and AND Quantum Systems naturally represent bits A spin (of say an electron or proton) is either up or down A hydrogen atom is either in lowest or (first) excited state etc. Quantum Computing - III Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic Incident photons can "read" state of system and so give I/O capabilities Quantum "bits" called qubits have another property as one has not only State |0> and state |1> but also Coherent states such as .7071*(|0> + |1>) which are equally in either state Lloyd describes how such coherent states provide new types of computing capabilities Natural random number as measuring state of qubit gives answer 0 or 1 randomly with equal probability As Feynman suggests, qubit based computers are natural for large scale simulation of quantum physical systems -- this is "just" analog computing Superconducting Technology -- Past Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light Superconducting is lower power and faster than diffusive electron transmission in CMOS At about 0.35micron chip feature size, CMOS transmission time changes from domination by transmission (Distance) issues to resistive (diffusive effects) Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS Josephson Junctions allow picosecond performance switches BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area Superconducting Technology -- Present New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach This naturally gives a bit which is 0 or 1 (or in fact n units!) This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time) Superconducting Technology -- Problems At least two major problems: Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure Currently perhaps $100M a year going into superconducting circuit area! How do we "bootstrap" superconducting industry? Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory So compared with current computers have a thousand times faster CPU, factor of four smaller cache of CPU speed and same speed basic memory as now Can such machines perform well -- need new algorithms? Can one design new superconducting memories? Superconducting technology also has a bad "name" due to IBM termination! Architecture Classes of High Performance Computers Sequential or von Neuman Architecture Vector (Super)computers Parallel Computers with various architectures classified by Flynn's methodology (this is incomplete as only discusses control or synchronization structure ) SISD MISD MIMD SIMD Metacomputers von Neuman Architecture in a Nutshell Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions The CPU can have multiple functional units The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency Illustration of Importance of Cache Fig 1.14 of Aspects of Computational Science Editor Aad van der Steen published by NCF Vector Supercomputers in a Nutshell - I This design enhances performance by noting that many applications calculate "vector-like" operations Such as c(i)=a(i)+b(i) for i=1...N and N quite large This allows one to address two performance problems Latency in accessing memory (e.g. could take 10-20 clock cycles between requesting a particular memory location and delivery of result to CPU) A complex operation , e.g. a floating point operation, can take a few machine cycles to complete They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi Vector Supercomputing in a picture A pipeline for vector addition looks like: From Aspects of Computational Science -- Editor Aad van der Steen published by NCF Vector Supercomputers in a Nutshell - II Vector machines pipeline data through the CPU They are not so popular/relevant as in the past as Improved C.P.U. architecture needs fewer cycles than before for each (complex) operation (e.g 4 now not ~100 as in past) 8 Mhz 8087 of Cosmic Cube took 160 to 400 clock cycles to do a full floating point operation in 1983 Applications need more flexible pipelines which allow different operations to be executed on consequitive operands as they stream through CPU Modern RISC processors (super scalar) can support such complex pipelines as they have far more logic than CPU's of the past In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline. Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications Flynn's Classification of HPC Systems Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting SIMD -- Single Instruction stream, Multiple Data Stream MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams. Parallel Computer Architecture Memory Structure Memory Structure of Parallel Machines Distributed Shared Cached and Heterogeneous mixtures Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times) Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures. Comparison of Memory Access Strategies Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below) We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures Types of Parallel Memory Architectures -- Physical Characteristics Uniform: All processors take the same time to reach all memory locations. Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logically shared but physically distributed. ALEWIFE from MIT is a similar project TERA (from Burton Smith) is Uniform memory access and logically shared memory machine Most NUMA machines these days have two memory access times Local memory (divided in registers caches etc) and Nonlocal memory with little or no difference in access time for different nonlocal memories This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs Diagrams of Shared and Distributed Memories Parallel Computer Architecture Control Structure SIMD -lockstep synchronization Each processor executes same instruction stream MIMD - Each Processor executes independent instruction streams MIMD Synchronization can take several forms Simplest: program controlled message passing "Flags" (barriers,semaphores) in memory - typical shared memory construct as in locks seen in Java Threads Special hardware - as in cache and its coherency (coordination between nodes) Some Major Hardware Architectures - MIMD MIMD Distributed Memory This is now best illustrated by a collection of computers on a network (i.e. a metacomputer) MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory. In near future, ALL formal (closely coupled) MPP's will be distributed shared memory Note all computers (e.g. current MIMD distributed memory IBM SP2) allow any node to get at any memory but this is done indirectly -- you send a message In future "closely-coupled" machines, there will be built in hardware supporting the function that any node can directly address all memory of the system This distributed shared memory architecture is currently of great interest to (a major challenge for) parallel compilers MIMD Distributed Memory Architecture A special case of this is a network of workstations (NOW's) or personal computers (metacomputer) Issues include: Node - CPU, Memory Network - Bandwidth, Memory Hardware Enhanced Access to distributed Memory Some Major Hardware Architectures - SIMD SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory Examples are CM-1,2 from Thinking Machines and AMT DAP and Maspar which are currently focussed entirely on accelerating parts of database indexing This architecture is of decreasing interest as has reduced functionality without significant cost advantage compared to MIMD machines Cost of synchronization in MIMD machines is not high! Main interest of SIMD is flexible bit arithmetic as processors "small" but as transistor densities get higher this also becomes less interesting as full function 64 bit CPU's only use a small fraction of silicon of modern computer SIMD (Single Instruction Multiple Data) Architecture CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) arithmetic, fast two-dimensional mesh and slower general switch for communication Some Major Hardware Architectures - Mixed Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines. Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory note that this can be a MIMD collection of SIMD machines if have a set of Maspar's on a network One can think of human brain as a SIMD machine and then a group of people is such a MIMD collection of SIMD processors Some MetaComputer Systems Cluster of workstations or PC's Heterogeneous MetaComputer System Comments on Special Purpose Devices One example is an Associative memory - SIMD or MIMD or content addressable memories This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100 You have only sped up the system by a factor 1.1 not by 100! The GRAPE N-Body Machine N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990) Essential reason is that such problems need much less memory per floating point unit than most problems Globular Cluster: 10^6 computations per datum stored Finite Element Iteration: A few computations per datum stored Rule of thumb is that one needs one gigabyte of memory per gigaflop of computation in general problems and this general design puts most cost into memory not into CPU. Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines Why isn't GRAPE a Perfect Solution? GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations Rather there is the O(N) or O(N)logN so called "fast-multipole" algorithm which uses hierarchical approach On one million stars, fast multipole is a factor of 100-1000 faster than GRAPE algorithm fast multipole works in most but not all N-body problems (in globular clusters, extreme heterogenity makes direct O(N^2) method most attractive) So special purpose devices cannot usually take advantage of new nifty algorithms! Granularity of Parallel Components - I Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ Communication ratio very high -- Typical of Networked Metacomputing Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ratio often hundreds or more. Typical of MIMD Parallel Systems such as SP2, CM5, Paragon, T3D Granularity of Parallel Components - II Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors often have instructions broadcasted to them. Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Kogge's Execube, Dally's J Machine or commercial Myrianet (Seitz) This is going to be very important in future petaflop architectures as the dense chips of year 2003 onwards favor this Processor in Memory Architecture So many "transistors" in future chips that "small processors" of the "future" will be similar to todays high end microprocessors As chips get denser, not realistic to put processors and memories on separate chips as granularities become too big Note that a machine of given granularity can be used on algorithms of the same or finer granularity Classes of Communication Networks The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together. Bus: All processors (and memory) connected to a common bus or busses. Memory access fairly uniform, but not very scalable due to contention Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by Bus. The Bus accessed memories can be local memories on other processors Switching Network: Processors (and memory) connected to routing switches like in telephone system. Switches might have queues and "combining logic", which improve functionality but increase latency. Switch settings may be determined by message headers or preset by controller. Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed) Usually NUMA, blocking, often scalable and upgradable Switch and Bus based Architectures Switch Bus Examples of Interconnection Topologies Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. Communication (operating system) software ensures that systems appears fully connected even if physical connections partial Useful Concepts in Communication Systems Useful terms include: Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout) Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link? How much of system is lost by fault? Blocking: Some communication requests may not get through, due to conflicts caused by other requests. Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor. Latency (delay): Maximal time for nonblocked request to be transmitted. Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task. Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets Communication Performance of Some MPP's From Aspects of Computational Science, Editor Aad van der Steen, published by NCF System Communication Speed Computation Speed Mbytes/sec(per link) Mflops/sec(per node) IBM SP2 40 267 Intel iPSC860 2.8 60 Intel Paragon 200 75 Kendall Square KSR-1 17.1 40 Meiko CS-2 100 200 Parsytec GC 20 25 TMC CM-5 20 128 Cray T3D 150 300 Implication of Hardware Performance tcomm = 4 or 8 /Speed in Mbytes sec as 4 or 8 bytes in a floating point word tfloat = 1/Speed in Mflops per sec Thus tcomm / tfloat is just 4 X Computation Speed divided by Communication speed tcomm / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems! Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of September 5 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Sept 5 1996 CPS615 Lecture This starts by considering the analytic form for communication overhead and illustrates its stencil dependence in simple local cases -- stressing relevance of grain size The implication for scaling and generalizing from Laplace example is covered We covered scaled speedup (fixed grain size) as well as fixed problem size We noted some useful material was missing and this was continued in next lecture (Sept 10,96) The lecture starts coverage of computer architecture covering base technologies with both CMOS covered in an earlier lecture contrasted to Quantum and Superconducting technology Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of September 10 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Sept 10 1996 CPS615 Lecture This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large We show two old examples from Caltech illustrates correctness of analytic form We return to discussion of computer architectures describing Vector Supercomputers General Relevance of data locality and pipelining Flynn's classification (MIMD,SIMD etc.) Memory Structures Initial issues in MIMD and SIMD discussion Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of September 12 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Sept 12 1996 CPS615 Lecture This continues the computer architecture discussion with MIMD and SIMD with distributed shared memory MetaComputers Special Purpose Architectures Granularity with technological changes forcing larger process sizes Overview of Communication Networks with Switches versus topologies versus buses Typical values in today's machines Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of September 24 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Sept 24 1996 CPS615 Lecture This continues the discussion of Fortran 90 with a set of overview remarks on each of the key new capabilities of this language We also comment on value of Fortran90/HPF in a world that will switch to Java We disgress to discuss a general theory of problem architectures as this explains such things as tradeoffs HPCC v Software Engineering HPF versus MPI And the types of applications each software model is designed to address (Note Server errors at start which confuses audio) Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of September 26 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Sept 26 1996 CPS615 Lecture This quickly completes the discussion of problem architecture but rather than continuing qualitative discussion of HPF applications in notes Jumped to a discussion of HPF language describing Basic Approach to Parallelism with "owner-computes" rule Types of new constructs with TEMPLATE ALIGN and PROCESSORS described The lecture started with a description of the Web based Programming Laboratory developed by Kivanc Dincer Embarassingly Parallel Problem Class Loosely Synchronous, Synchronous or Asynchronous classify problems by their "control" or "synchronization" structure However there is an important class of problems where this does not matter as the synchronization overhead -- even if in the difficult asynchronous case -- is irrelevant This is when overhead small or zero. These are "embarassingly parallel problems" where each component of decomposed problem is essentially independent Examples are: Financial Modelling where each component is calculating some expected value for a particular (possibly Monte Carlo) set of assumptions about the future OLTP (Online Transaction Processing) where each component is a separate checking of a credit card transaction against the account data Graphics rendering where each component is the calculation of the color of a particular pixel. CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- HPCC Software Technologies HPF and MPI Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of CPS615 HPCC Software Technologies We go through the 2D Laplace's Equation with both HPF and MPI for Simple Jacobi Iteration HPF and Fortran90 are reviewed followed by MPI We also discuss the structure of problems as these determine why and when certain software approaches are appropriate Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 1 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Oct 1 1996 CPS615 Lecture This continues the discussion of HPF in the area of distribution and ALIGN statements. The discussion of ALIGN should be improved as audio makes dubious statements about "broadcasting" information. The distribution discussion includes a reasonable descriuption of block and cyclic and when you should use them. Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 3 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 10 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 15 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 22 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 24 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of October 31 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of November 7 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of November 8 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Oct 24 1996 CPS615 Lecture This covers two topics: Monte Carlo Integration for large scale Problems using Experimental and Theoretical high energy physics as an example This includes accept-reject methods, uniform weighting and parallel algorithms Then we complete HPF discussion with embarassingly parallel DO INDEPENDENT discussed in Monte Carlo case And HPF2 Changes Abstract of Oct 22 1996 CPS615 Lecture This starts by finishing the simple overview of statistics Covering Gaussian Random Numbers, Numerical Generation of Random Numbers both sequentially and in parallel Then we describe the central limit theorem which underlies Monte Carlo method Then it returns to Numerical Integration with the first part of discussion of Monte Carlo Integration Abstract of Oct 15 1996 CPS615 Lecture This finishes the last part of N body and ODE discussion fociussing on pipeline data parallel algorithm Note several foils were changed after presentation and so discussion is a little disconnected from foils at times We start Numerical Integration with a basic discussion of Newton-Cotes formulae (including Trapezoidal and Simpson's rule) We illustrate them pictorially Abstract of Nov 7 1996 CPS615 Lecture This completes the MPI general discussion with the basic message passing, collective communication and some advanced features It then returns to Laplace Example foilset to show how MPI can be used here We have previously used this for HPF and performance analysis Abstract of Nov 8 1996 CPS615 Lecture This starts basic module on Partial Differential Solvers with Introduction to Classification of Equations Basic Discretization Derivation of Sparse Matrix Formulation Analogies of Iteration with Artificial Time Start of Explicit Matrix Formulation for Simple Cases Abstract of Oct 3 1996 CPS615 Lecture Tom Haupt on Fortran 90 / HPF Abstract of Oct 10 1996 CPS615 Lecture This discusses solution of systems of ordinary differential equations (ODE) in the context of N squared particle dynamics problems We start with motivation with brief discussion of relevant problems We go through basic theory of ODE's including Euler, Runge-Kutta, Predictor-Corrector and Multistep Methods We begin the discussion of solving N body problems using classic Connection Machine elegant but inefficient algorithm Note -- Some foils expanded to two after talk and second one is given without audio in script Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of November 14 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of November 26 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Lecture of December 5 - 1996 Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of Nov 14 1996 CPS615 Lecture This started with a description of current Web set-up of CPS615 and other foilsets Then we started the foilset describing Physical Simulations and the various approaches -- Continuum Physics, Monte Carlo, Quantum Dynamics, and Computational Fluid Dynamics For CFD we do enough to discuss why viscosity and High Reynolds numbers are critical in air and similar media We discuss computation and communication needs of CFD compared to Laplace equation Abstract of Nov 26 1996 CPS615 Lecture This covers essentially all the finite element method and its solution using the conjugate gradient method Only the simple 2D Laplace equation using triangular nodes is discussed We stress variational method as an optimization method and you use this analogy to motivate conjugate gradient as an improved steepest descent approach We discuss parallel computing issues for both finite element and conjugate gradient Abstract of Dec 5 1996 CPS615 Lecture This lecture covers two distinct areas. Firstly a short discussion of LInear Programming -- what type of problems its used for, what the equations look like and basic issues in the difficult use of parallel processing Then we give an abbreviated discussion of Full Matrix algorithms covering The types of applications that use them Matrix Multiplication including Cannon's algorithm in detail Use of MPI primitives including communicator groups Performance Analysis