Given by Geoffrey Fox, Peter Kogge at Trip to China on July 12-28,96. Foils prepared July 6 1996
Abstract * Foil Index for this file
See also color IMAGE
This describes some aspects of a national study of the future of HPCC which started with a meeting in February 1994 at Pasadena |
The SIA (Semiconductor Industry Association) projections are used to define feasible memory and CPU scenarios |
We describe hardware architecture with Superconducting and PIM (Processor in Memory possibilities) for CPU and optics for interconnect |
The Software situation is captured by notes from a working group at June 96 Bodega Bay meeting |
The role of new algorithms is expected to be very important |
This table of Contents
Abstract
http://www.npac.syr.edu/users/gcf/hpcc96petaflop/index.html |
Presented during Trip to China July 12-28,1996 |
Uses material from Peter Kogge -- Notre Dame |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This describes some aspects of a national study of the future of HPCC which started with a meeting in February 1994 at Pasadena |
The SIA (Semiconductor Industry Association) projections are used to define feasible memory and CPU scenarios |
We describe hardware architecture with Superconducting and PIM (Processor in Memory possibilities) for CPU and optics for interconnect |
The Software situation is captured by notes from a working group at June 96 Bodega Bay meeting |
The role of new algorithms is expected to be very important |
Feb. `94: Pasadena Workshop on Enabling Technologies for Petaflops Computing Systems |
March `95: Petaflops Workshop at Frontiers'95 |
Aug. `95: Bodega Bay Workshop on Applications |
PETA online: http://cesdis.gsfc.nasa.gov/petaflops/peta.html |
Jan. `96: NSF Call for 100 TF "Point Designs" |
April `96: Oxnard Petaflops Architecture Workshop (PAWS) on Architectures |
June `96: Bodega Bay Petaflops Workshop on System Software |
Oct. `96: Workshop at Frontiers `96 |
I find study interesting not only in its result but also in its methodology of several intense workshops combined with general discussions at national conferences |
Exotic technologies such as "DNA Computing" and Quantum Computing do not seem relevant on this timescale |
Note clock speeds will NOT improve much in the future but density of chips will continue to improve at roughly the current exponential rate over next 10-20 years |
Superconducting technology is currently seriously limited by no appropriate memory technology that matches factor of 100-1000 faster CPU processing |
Current project views software as perhaps the hardest problem |
All proposed designs have VERY deep memory hierarchies which are a challenge to algorithms, compilers and even communication subsystems |
Major need for hig-end performance computers comes from government (both civilian and military) applications
|
Government must develop systems using commercial suppliers but NOT relying on traditionasl industry applications to motivate |
So Currently Petaflop initiative is thought of as an applied development project whereas HPCC was mainly a research endeavour |
Why does one need a petaflop (1015 operations per second) computer? |
These are problems where quite viscous (oil, pollutants) liquids percolate through the ground |
Very sensitive to details of material |
Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects:
|
Oil Resevoir Simulation |
Geological variation occurs down to pore size of rock - almost 10-6 metres - model this (statistically) |
Want to calculate flow between wells which are about 400 metres apart |
103x103x102 = 108 grid elements |
30 species |
104 time steps |
300 separate cases need to be considered |
3x109 words of memory per case |
1012 words total if all cases considered in parallel |
1019 floating point operation |
3 hours on a petaflop computer |
For "Convential" MPP/Distributed Shared Memory Architecture |
Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers
|
In 1999, one will see production 1 Teraflop systems |
In 2003, one will see production 10 Teraflop Systems |
In 2007, one will see production 50-100 Teraflop Systems |
Memory is Roughly 0.25 to 1 Terabyte per 1 Teraflop |
If you are lucky/work hard: Realized performance is 30% of Peak |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94 |
Huge Problem Size
|
Real Time Computations
|
Indirect Needs for Petaflops technology
|
Conventional (Distributed Shared Memory) Silcon
|
Note Memory per Flop is much less than one to one |
Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75
|
Superconducting Technology is promising but can it compete with silicon juggernaut? |
Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes) |
Must use same DRAM technology as for silicon CPU ? |
So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed |
Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
|
One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
|
12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
|
Storage |
0.5 MB |
0.5 MB |
0.05 MB |
0.128 MB |
Chip |
EXECUBE |
AD SHARC |
TI MVP |
MIT MAP |
Terasys PIM |
First |
Silicon |
1993 |
1994 |
1994 |
1996 |
1993 |
Peak |
50 Mips |
120 Mflops |
2000 Mops |
800 Mflops |
625 M bit |
ops |
0.016 MB |
MB/ |
Perf. |
0.01 |
MB/Mip |
0.005 |
MB/MF |
0.000025 |
MB/Mop |
0.00016 |
MB/MF |
0.000026 |
MB/bit op |
Organization |
16 bit |
SIMD/MIMD CMOS |
Single CPU and |
Memory |
1 CPU, 4 DSP's |
4 Superscalar |
CPU's |
1024 |
16-bit ALU's |
MB per cm2 |
MF per cm2 |
MB/MF ratios |
MB/MF ratios |
Fixing 10-20 Terabytes of Memory, we can get |
16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
|
5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds |
12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities |
There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures |
Current tightly coupled MPP's offer more or less uniform access to off processor memory with serious degradation |
Future systems will return to situation of 80's where both data locality and nearest neighbor access will be essential for good performance |
Tremendous reward for Latency Tolerant algorithms and software support for this |
Note we need exactly the same characteristics in MetaComputing today |
World Wide Web is a good practice for Tomorrow's PetaFlop Machine! |
Aspects of the rapid march to Pflop/s systems needs to be decoupled from long-term R&D in architectures, software and high-level programming models. |
Designs need to expose the low-level execution model to the programmer. |
Access to memory is the key design consideration for Pflop/s systems. |
Enhanced message passing facilities are needed, including support for latency management, visualization and performance tools.
|
Research is needed in novel algorithms that explore a wider range of the latency/bandwidth/memory space. |
Define or speculate on novel applications that may be enabled by Pflops technology. |
Engage several broader application communities to identify and quantify the merits of the various Pflop/s designs. |
Need to develop a "common" set of applications with which to drive and evaluate various hardware and software approaches
|
Superconducting distributed memory
|
Distributed shared memory
|
A more aggressive PIM design
|
Latency tolerant algorithms will be essential
|
Hardware mechanisms for hiding and/or tolerating latency. |
Language facilities (such as programmer-directed prefetching) for managing latency. |
Need a "hierarchy little language" to describe latency structure |
Software tools, such as visualization facilities, for measuring and managing latency. |
Performance studies of distributed SMPs to study latency issues. |
On some original MPP's (e.g. hypercubes) one worried about nearest neighbor structure of algorithms so could match geometric structure of computers and problems. |
This became unnecessary and all that matters on most machines today is data locality
|
The year 2005 machines seem different as machine geometry seems relevant especially in PIM designs with physical transit time important as clock speeds increase. |
Today we see the standard trade-off between
|
Most people believe that it is easier to manage hierarchy (in an "automatic" fashion as in a compiler) than it is to manage distributed memory |
PIM gives us a classic distributed memory system as base design where as most other proposals (including Superconducting) give a shared memory as base design with hierarchy as issue.
|
It is difficult to compare the various point designs, because the application targets of each individual proposal are so different.
|
Research is thus needed to analyze the suitability of these systems for a handful of common grand challenge problems. |
Result: a set of quantitative requirements for various classes of systems. |
We made a start to this with a set of four "Petaflop kernels" |
The joint meeting of Application and Software Groups listed issues
|
Two classes of Problem
|
Need to prefetch irregular data |
Move regular and irregular data between levels of memory |
The Compiler User Interface needs hints and directives |
These need a "Hierarchy Little Language" to express memory heirarchy at all levels
|
Need (to investigate) Threads and Parallel Compilers for latency tolerance |
Need Latency Tolerant BLAS and higher level Capabilities
|
Performance should be monitored (with no software overhead) in hardware
|
Resource Management in presence of complex memory hierarchy |
Need to study separately the issues for
|
As always in year 2005 need latency-tolerant Operating Systems! |
Unclear message on need for Protection |
Need support for multithreading and sophisticated scheduling |
Scaling of Concurrent I/O ?
|
Message Passing
|
Need to be able to restart separate parts of the (large) hardware independently |
Need to support performance evaluation with different levels of memory hierarchy exposed |
Need Storage management and garbage collection
|
Need good job and resource management tools |
Checkpoint/restart only needed at natural synchoronization points |
June 17-18 1996 |
Geoffrey Fox, Chair |
Andrew Chien, Vice-Chair |
Define a "clean" model for machine architecture
|
Define a low level "Program Execution Model" (PEM) which allows one to describe movement of information and computation in the machine
|
On top of low level PEM, one can build an hierarchical (layered) software model
|
One can program at each layer of the software and augment it by "escaping" to a lower level to improve performance
|
One needs distributed and shared memory constructs in the PEM |
One should look at extending HPF directives to refer to memory hierarchy |
It is interesting to look at adding directives to high level software systems such as those based on objects |
One needs (performance) predictability in lowest level PEM
|
One needs layered software tools to match layered execution software
|
It is possible that support of existing software (teraApps) may not be emphasis |
MPI represents strucure of machines (such as original Caltech Hypercube) with just two levels of memory |
This addresses memory hierarchy intra-processor as well as inter-processor data movement |
Level 2 Cache |
Level 1 Cache |
5 Memory Hierarchy levels in each Processor |
Data Movement |
1)Deep Memory Hierarchies present New Challenges to High performance Implementation of programs
|
2)There are two dimensions of memory hierarchy management
|
3)One needs a machine "mode" which supports predictable and controllable memory system leading to communication and computation with same characteristics
|
4)One needs a low level software layer which provides direct control of the machine (memory hierarchy etc.) by a user program
|
5)One needs a layered (hierarchical) software model which supports an efficient use of multiple levels of abstraction in a single program.
|
6)One needs a set of software tools which match the layered software (programming model)
|
This is not really a simple stack but a set of complex relations between layers with many interfaces and modules |
Interfaces are critical ( for composition across layers)
|
Enable Next |
10000 |
Users |
For First 100 |
Pioneer Users |
Higher Level abstractions |
nearer to |
application domain |
Increasing Machine |
detail, control |
and management |
Application Specific Problem Solving Environment |
Coarse Grain Coordination Layer e.g. AVS |
Massively Parallel Modules -- such as DAGH HPF F77 C HPC++ |
Fortran or C plus generic message passing (get,put) and generic memory hierarchy and locality control |
Assembly Language plus specific (to architecture) data movement, shared memory and cache control |
High |
Level |
Low |
7)One needs hardware systems which can be used by system software developers to support software prototyping using emulation and simulation of proposed petacomputers. This will evaluate scaling of the software and reduce risk
|
8)One needs a benchmark set of applications which can be used to evaluate the candidate petacomputer architectures and their software
|
9)One needs to analyse carefully both the benchmark and other petaApplications to derive requirements on performance, capacity of essential subsystems (I/O, communication, computation including interlayer issues in hierarchy) of petacomputer |
10)The PetaComputer should be designed with a Cross disciplinary team including software application and hardware expertise |
11)For the initial "pioneer" users portability and tool/systems software robustness will be less important concerns than performance |
12)The broad set of users may require efficient support of current (by then legacy) programming interfaces such as MPI HPF HPC++ |
13)The petaComputer offers an opportunity to explore new software model unconstrained by legacy interfaces and codes |
14)Fault Resilience is essential in a large scale petaComputer |