Fox Presentation Spring 96 A Short Overview of HPCC From GigaFlops to PetaFlops From Tightly Coupled MPP's to the World Wide Web http://www.npac.syr.edu/users/gcf/cornellhpcc96/index.html CRPC MRA ReTooling Project Cornell May 7,96 Geoffrey Fox NPAC Syracuse University 111 College Place Syracuse NY 13244-4100 Abstract of a Short Overview of HPCC See SCCS-736 for overview of Parallel and SCCS 750 for distributed computing We discuss current and near future architectures as well the yet different trends expected 10 years from now COTS (see SCCS 758 and 732 for Web Software) philosophy dominates both hardware and software as success demands that niche applications leverage bigger fields Latency Tolerance will be an essential feature of future algorithm and software Data Parallelism is essential for success on large machines but current compilers are struggling Coordination or Integration software is thriving Some Important Trends -- COTS is King! Everybody now believes in COTS -- Consumer On the Shelf Technology -- one must use commercial building blocks for any specialized system whether it be a DoD weopens program or high end Supercomputer These are both Niche Markets! COTS for hardware can be applied to a greater or less extent Gordon Bell's SNAP system says we will only have ATM networks of PC's running WindowsNT SGI HP and IBM will take commodity processor nodes but link with custom switches (with different versions of distributed shared memory support) COTS for Software is less common but (I expect) to become much more common HPF producing HTTP not MPI with Java visulalization is an example of Software COTS Comments on COTS for Hardware Currently MPP's have COTS processors and specialized networks but this could reverse Pervasive ATM will indeed lead to COTS Networks BUT Current microprocessors are roughly near optimal in terms of megaflops per square meter of silicon BUT As (explicit) parallelism shunned by modern microprocessor, silicon is used for wasteful speculative execution with expectation that future systems will move to 8 way functional parallelism. Thus estimate that 250,000 transistors (excluding on chip cache) is optimal for performance per square mm of silicon Modern microprocessor is around ten times this size Again simplicity is optimal but this requires parallelism Contrary trend is that memory dominates use of silicon and so performance per square mm of silicon is often not relevant Supercomputer Architectures in Years 2005-2010 -- I Conventional (Distributed Shared Memory) Silcon Clock Speed 1GHz 4 eight way parallel Complex RISC nodes per chip 4000 Processing chips gives over 100 tera(fl)ops 8000 2 Gigabyte DRAM gives 16 Terabytes Memory Note Memory per Flop is much less than one to one Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75 If One Gigaflop requires One Gigabyte of memory (Or is it one Teraflop that needs one Terabyte?) Supercomputer Architectures in Years 2005-2010 -- II Superconducting Technology is promising but can it compete with silicon juggernaut? Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes) Must use same DRAM technology as for silicon CPU ? So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed Supercomputer Architectures in Years 2005-2010 -- III Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz) More Interesting in 2007 as processors are be "real" and have nontrivial amount of memory Naturally fetch a complete row (column) of memory at each access - perhaps 1024 bits One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of One Gigabyte of Memory 1000 250,000 transistor simple CPU's running at 1 Gigaflop each and each with one megabyte of on chip memory 12000 chips (Same amount of Silicon as in first design but perhaps more power) gives: 12 Terabytes of Memory 12 Petaflops performance This design "extrapolates" specialized DSP's , the GRAPE (specialized teraflop N body machine) etc to a "somewhat specialized" system with a general CPU but a special memory poor architecture with particular 2/3D layout Comparison of Supercomputer Architectures Fixing 10-20 Terabytes of Memory, we can get 16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy Peak Performance is 150 Teraflops with memory systems like today but worse with more levels of cache 5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds 12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures Algorithm and Software Challenges -- The Latency Agenda! Current tightly coupled MPP's offer more or less uniform access to off processor memory with serious degradation Future systems will return to situation of 80's where both data locality and nearest neighbor access will be essential for good performance Tremendous reward for Latency Tolerant algorithms and software support for this Note we need exactly the same characteristics in MetaComputing today World Wide Web is a good practice for Tomorrow's PetaFlop Machine! Returning to Today - I Tightly Coupled MPP's were (SP2,Paragon,CM5 etc) distributed memory but at least at the low end they are becoming hardware assisted shared memory Unclear how well compilers will support this in a scaling fashion -- we will see how SGI/Cray systems based ideas pioneered at Stanford fair! Note this is an example of COTS at work -- SGI/Sun/.. Symmetric Multiprocessors (Power Challenge from SGI) attractive as bus will support upto 16 processors in elegant shared memory software world. Previously such systems were not pwerful enough to be interesting Clustering such SGI Power Challenge like systems produces a powerful but difficult to program (as both distributed and shared memory) heterogeneous system Meanwhile Tera Computer will offer a true Uniform Memory access shared memory using ingenious multi threaded software/hardware to hide latency Unclear if this competitive in cost/performance with (scruffy) COTS approach Returning to Today - II Trend I -- Hardware Assisted Tightly Coupled Shared Memory MPP's are replacing pure distributed memory systems Trend II -- The World Wide Web and increasing power of individual workstations is making geographically distributed heterogeneous distributed memory systems more attractive Trend III -- To confuse the issue, the technology trends in next ten years suggest yet different architecture(s) such as PIM Better use Scalable Portable Software with conflicting technology/architecture trends BUT must address latency agenda which isn't clearly portable! Software Issues/Choices - I Functional Parallelism supported by sequential compiler intrachip in modern superscalar processor at very small grain size Multithreading as in Java supports object or small grain size functional parallelism within a Java Applet -- this can be implemented with message passing not shared memory Data Parallelism is source of Most scaling parallelism and needed for more than a few way speedup -- HPF emerging slowly as hardware/technology/politics (grand versus national challenges) changes faster than one can write good complete compilers Unfortunately most languages find it hard to express data parallelism in natural automatic way even though is "obvious in problem" Must use domain decomposition with target architecture dependent "objects" and so inevitably harder than pure object based parallelism where objects come from problem! The Sad Story of HPF and Some Applications The Good News -- In 1993, NPAC selected to Partner with two Grand Challenge Groups as part of their "computer science" support Both Applications were rather suitable for HPF as regular grids The Bad News -- Both Application groups have abandoned HPF as couldn't wait for working compilers with appropriate features But BOTH have switched to using Fortran90!! Numerical Relativity needed adaptive mesh and this supplied by DAGH from Texas which is like HPF but simpler in overall capability but can handle adaptive meshes But need to wait for new features in DAGH to handle say conjugate gradient which is automatic in HPF Need to parallelize sophisticated data types (tensors) in Fortran90 In Data Assimilation, a NASA Goddard application, they are recoding in Fortran90 plus MPI as MUST meet specific performance goals in a year Software Issues/Choices - II MetaProblems exhibit large grain size functional parallelism as in Multidisciplinary Optimization Coordination or Integration Software Addresses this. See SCCS 757 (available from online NPAC reports) which discusses applications and software approaches for all types of parallelism Java Applet Model is natural approach which can be illustrated by new VRML 2.0 which will build on Success of SIMNET/DSI -- Distributed Simulation for DoD with totally distributed set of Applets invoked by handlers (interrupts = active messages) loosely integrated over world wide Web -- a worldwide object parallelism e.g. Pressing door knob invokes Java applet as does missile hitting tank. Software Issues/Choices - III Architectures such as PIM emphasis opportunities if we could develop software models/compilers which could effectively use substantial parallelism Current partyline microprocessors assume that substantial parallelism cannot be easily extracted from programs designed for single Chips One area we are exploring is how to extract from/build in parallelism for Java We call this the HPJava project A major new direction in Computer Science is "Problem Solving Environments" which are domain specific systems at a "higher level" than compilers which are a "toolkit" of components such as: Linear Algebra Solvers Elliptic or Hyperbolic PDE Solvers Particular (adaptive) data structures such as those for multigrid Visualization etc. A Summary of Application Working Group Activities at PAWS'96 including Point Group and Software Group Interactions http://www.npac.syr.edu/users/gcf/petasoft96/index.html Petasoft Meeting Bodega Bay June 17,1996 Geoffrey Fox NPAC Syracuse University 111 College Place Syracuse NY 13244-4100 Application Participants Included Geoffrey Fox -- Syracuse University David Bailey -- NASA Ames Sid Karin -- UCSD Jean-Noel Leboeuf -- Oak Ridge National Laboratory Peter Lyster -- University of Maryland Andrea Malagoli -- University of Chicago Michael McGlaun -- Sandia National Laboratory John Salmon -- California Institute of Technology Rich Zippel -- Cornell Peak Supercomputer Performance For "Convential" MPP/Distributed Shared Memory Architecture Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers Note both SGI and IBM are changing architectures: IBM Distributed Memory to Distributed Shared Memory SGI Shared Memory to Distributed Shared Memory In 1999, one will see production 1 Teraflop systems In 2003, one will see production 10 Teraflop Systems In 2007, one will see production 50-100 Teraflop Systems Memory is Roughly 0.25 to 1 Terabyte per 1 Teraflop If you are lucky/work hard: Realized performance is 30% of Peak