New CPS615 Foils 25 March95 CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1995 Foilsets A Geoffrey Fox NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Contents of Foilsets A of CPS615 Computational Science Technology Driving Forces for HPCC Overview of What and Why is Computational Science Needs to be expanded with further remarks on Information track and degree/certificate requirements Elementary Discussion of Parallel Computing in the "real-world" Hadrian Wall example Sequential Computer Architecture The Technology Driving Forces for HPCC Effect of Feature Size on Performance Growing Logic Chip Density Trends in Feature and Die Size as a Function of Time Supercomputer Memory Sizes and trends in RAM Density RAM density increases by about a factor of 50 in 8 years Supercomputers in 1992 have memory sizes around 32 gigabytes (giga = 109) Supercomputers in year 2000 should have memory sizes around 1.5 terabytes (tera = 1012) Comparison of Trends in RAM Density and CPU Performance Increases Computer Performance is increasing faster than RAM density National Roadmap for Semiconductor Technology --1992 See Chapter 5 of Petaflops Report -- July 95 CMOS Technology and Parallel Processor Chip Projections See Chapter 5 of Petaflops Report -- July 95 What and Why is Computational Science ? Parallelism Implies Major Changes which have significant educational Implications Different machines New types of computers New libraries Rewritten Applications Totally new fields able to use computers .... ==> Need new educational initiatives Computational Science Will be a nucleus for the phase transition and accelerate use of parallel computers in the real world Program in Computational Science Implemented within current academic framework Program in Information Age Computational Science Implemented Within Current Academic Program Elementary Discussion of Parallel Computing Single nCUBE2 CPU Chip 64 Node nCUBE Board Each node is CPU and 6 memory chips -- CPU Chip integrates communication channels with floating, integer and logical CPU functions CM-5 in NPAC Machine Room 32 node CM-5 and in foreground old CM-2 diskvault Basic METHODOLOGY of Parallel Computing Simple, but general and extensible to many more nodes is domain decomposition All successful concurrent machines with Many nodes High performance (this excludes Dataflow) Have obtained parallelism from "Data Parallelism" or "Domain Decomposition" Problem is an algorithm applied to data set and obtains concurrency by acting on data concurrently. The three architectures considered here differ as follows: MIMD Distributed Memory Processing and Data Distributed MIMD Shared Memory Processing Distributed but data shared SIMD Distributed Memory Synchronous Processing on Distributed Data Concurrent Computation as a Mapping Problem -I 2 Different types of Mappings in Physical Spaces Both are static a) Seismic Migration with domain decomposition on 4 nodes b)Universe simulation with irregular data but static 16 node decomposition but this problem would be best with dynamic irregular decomposition Concurrent Computation as a Mapping Problem - II Different types of Mappings -- A very dynamic case without any underlying Physical Space c)Computer Chess with dynamic game tree decomposed onto 4 nodes Concurrent Computation as a Mapping Problem - III Finite Element Mesh From Nastran (mesh only shown in upper half) A Simple Equal Area Decomposition And the corresponding poor workload balance Decomposition After Annealing (one particularly good but nonoptimal decomposition) And excellent workload balance Parallel Processing and Society The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists. If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people... cf. Construction of Hadrians Wall Concurrent Construction of a Wall Using N = 8 Bricklayers Decomposition by Vertical Sections Domain Decomposition is Key to Parallelism Need "Large" Subdomains l >> l overlap Quantitative Speed-Up Analysis for Construction of Hadrian's Wall Amdahl's law for Real World Parallel Processing AMDAHL"s LAW or Too many cooks spoil the broth Says that Speedup S is small if efficiency e small or for Hadrian's wall equivalently S is small if length l small But this is irrelevant as we do not need parallel processing unless problem big! Pipelining --Another Parallel Processing Strategy for Hadrian's Wall "Pipelining" or decomposition by horizontal section is: In general less effective and leads to less parallelism (N = Number of bricklayers must be < number of layers of bricks) Hadrian's Wall Illustrates that the Topology of Processor Must Include Topology of Problem Hadrian's Wall is one dimensional Humans represent a flexible processor node that can be arranged in different ways for different problems The lesson for computing is: Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability. General Speed Up Analysis Comparing Computer and Hadrian's Wall Cases Comparison of The Complete Problem to the subproblems formed in domain decomposition The case of Programming a Hypercube Each node runs software that is similar to sequential code e.g., FORTRAN with geometry and boundary value sections changed Hadrian's Wall Illustrating an Irregular but Homogeneous Problem Geometry irregular but each brick takes about the same amount of time to lay. Decomposition of wall for an irregular geometry involves equalizing number of bricks per mason, not length of wall per mason. Some Problems are Inhomogeneous Illustrated by: An Inhomogeneous Hadrian Wall with Decoration Fundamental entities (bricks, gargoyles) are of different complexity Best decomposition dynamic Inhomogeneous problems run on concurrent computers but require dynamic assignment of work to nodes and strategies to optimize this (we use neural networks, simulated annealing, spectral bisection etc.) Global and Local Parallelism Illustrated by Hadrian's Wall Global Parallelism Break up domain Amount of Parallelism proportional to size of problem (and is usually large) Unit is Bricklayer or Computer node Local Parallelism Do in parallel local operations in the processing of basic entities e.g. for Hadrian's problem, use two hands, one for brick and one for mortar while ... for computer case, do addition at same time as multiplication Local Parallelism is limited but useful Local and Global Parallelism Should both be Exploited Parallel I/O Illustrated by Concurrent Brick Delivery for Hadrian's Wall Bandwidth of Trucks and Roads Matches that of Masons Disk (input/output) Technology is better matched to several modest power processors than to a single sequential supercomputer Concurrent Computers natural in databases, transaction analysis Nature's Concurrent Computers At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites At a coarser resolution Society is a collection of brains sending and receiving messages by sight and sound Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals Lesson: All Nature's Computers Use Message Passing With several different Architectures Comparison of Concurrent Processing in Society and Computing Problems are large - use domain decomposition Overheads are edge effects Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains Regular homogeneous problems easiest but irregular or Inhomogeneous Can use local and global parallelism Can handle concurrent calculation and I/O Nature always uses message passing as in parallel computers (at lowest level) Sequential Computer Architecture Sequential Computer Architecture Instruction Flow in A Simple Machine Pipeline Three Instructions are shown overlapped -- each starting one clock cycle after last Examples of Superpipelined (a) and superscaler (b) machine pipelines