HELP! * YELLOW=global GREY=local Full HTML for

GLOBAL foilset HPCC Futures Topic 2:A Possible PetaFlop Initiative

Given by Geoffrey Fox, Peter Kogge at Trip to China on July 12-28,96. Foils prepared July 6 1996
Abstract * Foil Index for this file See also color IMAGE

This describes some aspects of a national study of the future of HPCC which started with a meeting in February 1994 at Pasadena
The SIA (Semiconductor Industry Association) projections are used to define feasible memory and CPU scenarios
We describe hardware architecture with Superconducting and PIM (Processor in Memory possibilities) for CPU and optics for interconnect
The Software situation is captured by notes from a working group at June 96 Bodega Bay meeting
The role of new algorithms is expected to be very important

Table of Contents for full HTML of HPCC Futures Topic 2:A Possible PetaFlop Initiative


1 Status of "Classic" HPCC -- June1996
Futures-2: Petaflops and Real Software in 2007?

2 Abstract of HPCC Futures 2: PetaFlop in 2007!
3 Petaflop Chain of Events
4 Overall Remarks on the March to PetaFlops - I
5 Overall Remarks on the March to PetaFlops - II
6 Petaflop Performance for Flow in Porous Media?
7 Target Flow in Porous Media Problem (Glimm - Petaflop Workshop)
8 NASA's Projection of Memory and Computational Requirements upto Petaflops for Aerospace Applications
9 Peak Supercomputer Performance
10 Pasadena Architectures
11 Chip Density Projections to year 2013
12 Clock Speed and I/O Speed in megabytes/sec per pin through year 2013
13 PetaFlops Applications
14 Supercomputer Architectures in Years 2005-2010 -- I
15 Supercomputer Architectures in Years 2005-2010 -- II
16 Supercomputer Architectures in Years 2005-2010 -- III
17 Current PIM Chips
18 New "Strawman" PIM Processing Node Macro
19 "Strawman" Chip Floorplan
20 SIA-Based PIM Chip Projections
21 Comparison of Supercomputer Architectures
22 Algorithm and Software Challenges -- The Latency Agenda!
23 Overall Suggestions -- I
24 Overall Suggestions - II
25 Other Suggested Point Designs
26 Latency Research Is Needed
27 Geometric Structure of Problems and Computers
28 Memory Hierarchy versus Distribution
29 Needed Algorithm/Application Evaluations
30 Application Oriented Software Issues -- April 24,1996
31 Language Related Issues
32 Library and Tool Issues
33 Operating System Issues - I
34 Operating System Issues - II
35 "Initial" Findings of the "Implementation" Subgroup at PetaSoft 96
36 Initial Thoughts I
37 Initial Thoughts II
38 The MPI Program Execution Model
39 The PetaSoft Program Execution Model
40 Findings 1) and 2) -- Memory Hierarchy
41 Findings 3) and 4) -- Using Memory Hierarchy
42 Findings 5) and 6) -- Layered Software
43 The Layered Software Model
44 Some Examples of a Layered Software System
45 Finding 7) Testbeds
46 Findings 8) and 9) Applications
47 Findings 10) to 14) General Points

This table of Contents Abstract



HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 1 Status of "Classic" HPCC -- June1996
Futures-2: Petaflops and Real Software in 2007?

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
http://www.npac.syr.edu/users/gcf/hpcc96petaflop/index.html
Presented during Trip to China July 12-28,1996
Uses material from Peter Kogge -- Notre Dame
Geoffrey Fox
NPAC
Syracuse University
111 College Place
Syracuse NY 13244-4100

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 2 Abstract of HPCC Futures 2: PetaFlop in 2007!

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
This describes some aspects of a national study of the future of HPCC which started with a meeting in February 1994 at Pasadena
The SIA (Semiconductor Industry Association) projections are used to define feasible memory and CPU scenarios
We describe hardware architecture with Superconducting and PIM (Processor in Memory possibilities) for CPU and optics for interconnect
The Software situation is captured by notes from a working group at June 96 Bodega Bay meeting
The role of new algorithms is expected to be very important

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 3 Petaflop Chain of Events

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Feb. `94: Pasadena Workshop on Enabling Technologies for Petaflops Computing Systems
March `95: Petaflops Workshop at Frontiers'95
Aug. `95: Bodega Bay Workshop on Applications
PETA online: http://cesdis.gsfc.nasa.gov/petaflops/peta.html
Jan. `96: NSF Call for 100 TF "Point Designs"
April `96: Oxnard Petaflops Architecture Workshop (PAWS) on Architectures
June `96: Bodega Bay Petaflops Workshop on System Software
Oct. `96: Workshop at Frontiers `96

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 4 Overall Remarks on the March to PetaFlops - I

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
I find study interesting not only in its result but also in its methodology of several intense workshops combined with general discussions at national conferences
Exotic technologies such as "DNA Computing" and Quantum Computing do not seem relevant on this timescale
Note clock speeds will NOT improve much in the future but density of chips will continue to improve at roughly the current exponential rate over next 10-20 years
Superconducting technology is currently seriously limited by no appropriate memory technology that matches factor of 100-1000 faster CPU processing
Current project views software as perhaps the hardest problem

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 5 Overall Remarks on the March to PetaFlops - II

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
All proposed designs have VERY deep memory hierarchies which are a challenge to algorithms, compilers and even communication subsystems
Major need for hig-end performance computers comes from government (both civilian and military) applications
  • DoE ASCI (study of aging of nuclear weopens) and Weather/Climate prediction are two examples
Government must develop systems using commercial suppliers but NOT relying on traditionasl industry applications to motivate
So Currently Petaflop initiative is thought of as an applied development project whereas HPCC was mainly a research endeavour

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 6 Petaflop Performance for Flow in Porous Media?

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Why does one need a petaflop (1015 operations per second) computer?
These are problems where quite viscous (oil, pollutants) liquids percolate through the ground
Very sensitive to details of material
Most important problems are already solved at some level, but most solutions are insufficient and need improvement in various respects:
  • under resolution of solution details, averaging of local variations and under representation of physical details
  • rapid solutions to allow efficient exploration of system parameters
  • robust and automated solution, to allow integration of results in high level decision, design and control functions
  • inverse problems (history match) to reconstruct missing data require multiple solutions of the direct problem

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 7 Target Flow in Porous Media Problem (Glimm - Petaflop Workshop)

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Oil Resevoir Simulation
Geological variation occurs down to pore size of rock - almost 10-6 metres - model this (statistically)
Want to calculate flow between wells which are about 400 metres apart
103x103x102 = 108 grid elements
30 species
104 time steps
300 separate cases need to be considered
3x109 words of memory per case
1012 words total if all cases considered in parallel
1019 floating point operation
3 hours on a petaflop computer

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 8 NASA's Projection of Memory and Computational Requirements upto Petaflops for Aerospace Applications

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 9 Peak Supercomputer Performance

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
For "Convential" MPP/Distributed Shared Memory Architecture
Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers
  • Note both SGI and IBM are changing architectures:
  • IBM Distributed Memory to Distributed Shared Memory
  • SGI Shared Memory to Distributed Shared Memory
In 1999, one will see production 1 Teraflop systems
In 2003, one will see production 10 Teraflop Systems
In 2007, one will see production 50-100 Teraflop Systems
Memory is Roughly 0.25 to 1 Terabyte per 1 Teraflop
If you are lucky/work hard: Realized performance is 30% of Peak

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 10 Pasadena Architectures

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 11 Chip Density Projections to year 2013

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 12 Clock Speed and I/O Speed in megabytes/sec per pin through year 2013

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
Extrapolated from SIA Projections to year 2007 -- See Chapter 6 of Petaflops Report -- July 94

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 13 PetaFlops Applications

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Huge Problem Size
  • Astrophysics, particle physics
  • Oil Reservoir, ground water modeling
  • Quantum chemical studies (eg. of AIDS viruses)
  • Genome search problems
Real Time Computations
  • Global Weather
  • 3D Heart modeling "in the operating room"
  • Global image database queries
  • Video image fusion with virtual environments
Indirect Needs for Petaflops technology
  • Ubiquitous, keyboardless computers
  • Virtual reality, educational "what if"
  • High volume consumer applications (video games)

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 14 Supercomputer Architectures in Years 2005-2010 -- I

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Conventional (Distributed Shared Memory) Silcon
  • Clock Speed 1GHz
  • 4 eight way parallel Complex RISC nodes per chip
  • 4000 Processing chips gives over 100 tera(fl)ops
  • 8000 2 Gigabyte DRAM gives 16 Terabytes Memory
Note Memory per Flop is much less than one to one
Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75
  • If One Gigaflop requires One Gigabyte of memory (Or is it one Teraflop that needs one Terabyte?)

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 15 Supercomputer Architectures in Years 2005-2010 -- II

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Superconducting Technology is promising but can it compete with silicon juggernaut?
Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes)
Must use same DRAM technology as for silicon CPU ?
So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 16 Supercomputer Architectures in Years 2005-2010 -- III

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
  • More Interesting in 2007 as processors are be "real" and have nontrivial amount of memory
  • Naturally fetch a complete row (column) of memory at each access - perhaps 1024 bits
One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
  • One Gigabyte of Memory
  • 1000 250,000 transistor simple CPU's running at 1 Gigaflop each and each with one megabyte of on chip memory
12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
  • 12 Terabytes of Memory
  • 12 Petaflops performance
  • This design "extrapolates" specialized DSP's , the GRAPE (specialized teraflop N body machine) etc to a "somewhat specialized" system with a general CPU but a special memory poor architecture with particular 2/3D layout

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 17 Current PIM Chips

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
Storage
0.5 MB
0.5 MB
0.05 MB
0.128 MB
Chip
EXECUBE
AD SHARC
TI MVP
MIT MAP
Terasys PIM
First
Silicon
1993
1994
1994
1996
1993
Peak
50 Mips
120 Mflops
2000 Mops
800 Mflops
625 M bit
ops
0.016 MB
MB/
Perf.
0.01
MB/Mip
0.005
MB/MF
0.000025
MB/Mop
0.00016
MB/MF
0.000026
MB/bit op
Organization
16 bit
SIMD/MIMD CMOS
Single CPU and
Memory
1 CPU, 4 DSP's
4 Superscalar
CPU's
1024
16-bit ALU's

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 18 New "Strawman" PIM Processing Node Macro

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 19 "Strawman" Chip Floorplan

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 20 SIA-Based PIM Chip Projections

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
MB per cm2
MF per cm2
MB/MF ratios
MB/MF ratios

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 21 Comparison of Supercomputer Architectures

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Fixing 10-20 Terabytes of Memory, we can get
16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
  • Peak Performance is 150 Teraflops with memory systems like today but worse with more levels of cache
5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds
12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities
There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 22 Algorithm and Software Challenges -- The Latency Agenda!

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Current tightly coupled MPP's offer more or less uniform access to off processor memory with serious degradation
Future systems will return to situation of 80's where both data locality and nearest neighbor access will be essential for good performance
Tremendous reward for Latency Tolerant algorithms and software support for this
Note we need exactly the same characteristics in MetaComputing today
World Wide Web is a good practice for Tomorrow's PetaFlop Machine!

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 23 Overall Suggestions -- I

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Aspects of the rapid march to Pflop/s systems needs to be decoupled from long-term R&D in architectures, software and high-level programming models.
Designs need to expose the low-level execution model to the programmer.
Access to memory is the key design consideration for Pflop/s systems.
Enhanced message passing facilities are needed, including support for latency management, visualization and performance tools.
  • Expect that initial Petaflop applications will be built using "brute force" and not use "seamless" high level tools
  • These brute-force experiments will guide high level tool design

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 24 Overall Suggestions - II

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Research is needed in novel algorithms that explore a wider range of the latency/bandwidth/memory space.
Define or speculate on novel applications that may be enabled by Pflops technology.
Engage several broader application communities to identify and quantify the merits of the various Pflop/s designs.
Need to develop a "common" set of applications with which to drive and evaluate various hardware and software approaches
  • We made a start with five Petaflop kernels

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 25 Other Suggested Point Designs

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Superconducting distributed memory
  • A conventional distributed memory design where each node consists of a superconducting CPU, superconducting memory and conventional DRAM memory.
Distributed shared memory
  • How far can the emerging distributed shared memory concept can be extended? Can a Pflop/s system be constructed from one or a number of DSM systems?
A more aggressive PIM design
  • Essentially a general purpose Grape where we use half the silicon for as many "optimal" (250,000 transistor according to Kogge) CPU's that we can

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 26 Latency Research Is Needed

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Latency tolerant algorithms will be essential
  • Today the most extreme latency issues are seen in networks of workstations and other such metacomputers.
Hardware mechanisms for hiding and/or tolerating latency.
Language facilities (such as programmer-directed prefetching) for managing latency.
Need a "hierarchy little language" to describe latency structure
Software tools, such as visualization facilities, for measuring and managing latency.
Performance studies of distributed SMPs to study latency issues.

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 27 Geometric Structure of Problems and Computers

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
On some original MPP's (e.g. hypercubes) one worried about nearest neighbor structure of algorithms so could match geometric structure of computers and problems.
This became unnecessary and all that matters on most machines today is data locality
  • e.g. on machines like IBM SP-2, on processor and off processor memory access times are very different but there is no difference between the different processors once you go off CPU
The year 2005 machines seem different as machine geometry seems relevant especially in PIM designs with physical transit time important as clock speeds increase.

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 28 Memory Hierarchy versus Distribution

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Today we see the standard trade-off between
  • Memory Hierarchy as in a Shared Memory SGI Power Challenge
  • Distributed Memory as in IBM SP-2
  • And indeed a merging of these ideas with Distributed Shared Memory
Most people believe that it is easier to manage hierarchy (in an "automatic" fashion as in a compiler) than it is to manage distributed memory
PIM gives us a classic distributed memory system as base design where as most other proposals (including Superconducting) give a shared memory as base design with hierarchy as issue.
  • Will increased memory hierarchy make its management so much harder so that trade off will change ?
  • Management of year 2005 distributed memory does not get much harder from systems we are familiar with today?

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 29 Needed Algorithm/Application Evaluations

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
It is difficult to compare the various point designs, because the application targets of each individual proposal are so different.
  • Also these each have rather modest funding and can not be expected to satisfy all our requirements
Research is thus needed to analyze the suitability of these systems for a handful of common grand challenge problems.
Result: a set of quantitative requirements for various classes of systems.
We made a start to this with a set of four "Petaflop kernels"

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 30 Application Oriented Software Issues -- April 24,1996

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
The joint meeting of Application and Software Groups listed issues
  • Language Issues including Hierarchy Little Language
  • Library and Tool Issues
  • Operating System Issues
Two classes of Problem
  • Classic SPMD are in some sense the hardest as they have most restrictive latency and bandwidth requirements
  • Multidisciplinary applications are a loosely coupled mix of SPMD programs
    • Loose coupling stresses functionality but not performance of the software

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 31 Language Related Issues

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Need to prefetch irregular data
Move regular and irregular data between levels of memory
The Compiler User Interface needs hints and directives
These need a "Hierarchy Little Language" to express memory heirarchy at all levels
  • This will all one to get portability by expressing each architecture with a parameterization of its memory structure with such things as cache size etc.
  • Compare to register instruction in C; SCM and LCM in 7600; distribute in HPF

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 32 Library and Tool Issues

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Need (to investigate) Threads and Parallel Compilers for latency tolerance
Need Latency Tolerant BLAS and higher level Capabilities
  • FFT, Linear Algebra, Adaptive Mesh, Collective Data Movement
Performance should be monitored (with no software overhead) in hardware
  • Need to incorporate the "Hierarchy Little Language":
  • "Pablo"-like software gather and visualization
Resource Management in presence of complex memory hierarchy

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 33 Operating System Issues - I

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Need to study separately the issues for
  • "Micromemory" -- as little as 1 megabyte of memory per processor in PIM
  • General Hierarchical and Heterogeneous (distributed shared memory) Systems
As always in year 2005 need latency-tolerant Operating Systems!
Unclear message on need for Protection
Need support for multithreading and sophisticated scheduling
Scaling of Concurrent I/O ?
  • Need to understand this at teraflop performance level first!
  • Can one use COMA (Cache Only Memory) ideas for I/O

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 34 Operating System Issues - II

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Message Passing
  • Clearly need buffer management, flow control, protocols etc.
  • But what is appropriate mechanisms -- need to revisit comparison of active messages versus MPI etc.
Need to be able to restart separate parts of the (large) hardware independently
Need to support performance evaluation with different levels of memory hierarchy exposed
Need Storage management and garbage collection
  • PIM micromemory different from hierarchical shared memory
Need good job and resource management tools
Checkpoint/restart only needed at natural synchoronization points

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 35 "Initial" Findings of the "Implementation" Subgroup at PetaSoft 96

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
June 17-18 1996
Geoffrey Fox, Chair
Andrew Chien, Vice-Chair

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 36 Initial Thoughts I

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
Define a "clean" model for machine architecture
  • Memory hierarchy including caches and geomterical (distributed) effects
Define a low level "Program Execution Model" (PEM) which allows one to describe movement of information and computation in the machine
  • This can be thought of as "MPI"/assembly language of the machine
On top of low level PEM, one can build an hierarchical (layered) software model
  • At the top of this layered software model, one finds objects or Problem Solving Environments (PSE's)
  • At an intermediate level there is Parallel C C++ or Fortran
One can program at each layer of the software and augment it by "escaping" to a lower level to improve performance
  • Directives (HPF assertions) and explicit insertion of lower level code (HPF extrinsics) are possible

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 37 Initial Thoughts II

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
One needs distributed and shared memory constructs in the PEM
One should look at extending HPF directives to refer to memory hierarchy
It is interesting to look at adding directives to high level software systems such as those based on objects
One needs (performance) predictability in lowest level PEM
  • User control must be possible for any significant caches
  • Note that as one goes to higher layers in the software model, useability increases and predictability decreases
One needs layered software tools to match layered execution software
  • Performance Monitoring
  • Load Balancing -- this should be under user control -- I.e. in runtime and not O/S
  • Debugging
It is possible that support of existing software (teraApps) may not be emphasis

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 38 The MPI Program Execution Model

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
MPI represents strucure of machines (such as original Caltech Hypercube) with just two levels of memory

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 39 The PetaSoft Program Execution Model

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
This addresses memory hierarchy intra-processor as well as inter-processor data movement
Level 2 Cache
Level 1 Cache
5 Memory Hierarchy levels in each Processor
Data Movement

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 40 Findings 1) and 2) -- Memory Hierarchy

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
1)Deep Memory Hierarchies present New Challenges to High performance Implementation of programs
  • Latency
  • Bandwidth
  • Capacity
2)There are two dimensions of memory hierarchy management
  • Geometric or Global Structure
  • Local (cache) hierachies seen from thread or processor centric view

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 41 Findings 3) and 4) -- Using Memory Hierarchy

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
3)One needs a machine "mode" which supports predictable and controllable memory system leading to communication and computation with same characteristics
  • Allow Compiler optimization
  • Allow Programmer control and optimization
  • For instance high performance would often need full program control of caches
4)One needs a low level software layer which provides direct control of the machine (memory hierarchy etc.) by a user program
  • This for initial users and program tuning

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 42 Findings 5) and 6) -- Layered Software

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
5)One needs a layered (hierarchical) software model which supports an efficient use of multiple levels of abstraction in a single program.
  • Higher levels of Programming model hide extraneous complexity
  • highest layers are application dependent Problem Solving Environments and lower levels are machine dependent
  • Lower levels can be accessed for additional performance
  • e.g. HPF Extrinsics. Gcc ASM, MATLAB Fortran Routines, Native classes in Java
6)One needs a set of software tools which match the layered software (programming model)
  • Debuggers, Performance and load balancing tools

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 43 The Layered Software Model

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
This is not really a simple stack but a set of complex relations between layers with many interfaces and modules
Interfaces are critical ( for composition across layers)
  • Enable control and performance for application scientists
  • Decouple CS system issues and allow exploration and innovation
Enable Next
10000
Users
For First 100
Pioneer Users
Higher Level abstractions
nearer to
application domain
Increasing Machine
detail, control
and management

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 44 Some Examples of a Layered Software System

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * Critical Information in IMAGE
Full HTML Index
Application Specific Problem Solving Environment
Coarse Grain Coordination Layer e.g. AVS
Massively Parallel Modules -- such as DAGH HPF F77 C HPC++
Fortran or C plus generic message passing (get,put) and generic memory hierarchy and locality control
Assembly Language plus specific (to architecture) data movement, shared memory and cache control
High
Level
Low

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 45 Finding 7) Testbeds

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
7)One needs hardware systems which can be used by system software developers to support software prototyping using emulation and simulation of proposed petacomputers. This will evaluate scaling of the software and reduce risk
  • These systems must be available early so that working software is delivered at same time as deployed hardware
  • This prototyping machine must allow one to change all parts of the operating system and so hard to share machine during prototyping

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 46 Findings 8) and 9) Applications

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
8)One needs a benchmark set of applications which can be used to evaluate the candidate petacomputer architectures and their software
  • These applications should capture key features (size, adaptivity etc.) of proposed petaApplications
9)One needs to analyse carefully both the benchmark and other petaApplications to derive requirements on performance, capacity of essential subsystems (I/O, communication, computation including interlayer issues in hierarchy) of petacomputer

HELP! * YELLOW=global GREY=local HTML version of GLOBAL Foils prepared July 6 1996

Foil 47 Findings 10) to 14) General Points

From HPCC Futures Topic 2:A Possible PetaFlop Initiative Trip to China -- July 12-28,96. * See also color IMAGE
Full HTML Index
10)The PetaComputer should be designed with a Cross disciplinary team including software application and hardware expertise
11)For the initial "pioneer" users portability and tool/systems software robustness will be less important concerns than performance
12)The broad set of users may require efficient support of current (by then legacy) programming interfaces such as MPI HPF HPC++
13)The petaComputer offers an opportunity to explore new software model unconstrained by legacy interfaces and codes
14)Fault Resilience is essential in a large scale petaComputer

Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Tue Feb 18 1997