Given by Geoffrey C. Fox at PAWS(Mandalay Beach) and PetSoft(Bodega Bay) on April 23 and June 17-19,96. Foils prepared August 4 1996
Outside Index
Summary of Material
Summary of Application Working Group heaeded by Fox at April 96 Mandalay Beach PAWS Meeting
|
Summary of PetaSoft Working Group headed by Fox and Chien from june 96 Bodega Bay Meeting
|
Outside Index Summary of Material
Based on eight presentations, we could make a clear assessment only on GRAPE. |
None of the designs is yet defined in sufficient quantitative detail to verify suitability on algorithms and applications of interest. |
Due to the greatly different applications proposed by the various designs, is difficult to compare the systems. |
There are some omissions in the proposed point designs -- some other possible designs are worth studying. |
Aspects of the rapid march to Pflop/s systems needs to be decoupled from long-term R&D in architectures, software and high-level programming models. |
Designs need to expose the low-level execution model to the programmer. |
Access to memory is the key design consideration for Pflop/s systems. |
Enhanced message passing facilities are needed, including support for latency management, visualization and performance tools.
|
Research is needed in novel algorithms that explore a wider range of the latency/bandwidth/memory space. |
Define or speculate on novel applications that may be enabled by Pflops technology. |
Engage several broader application communities to identify and quantify the merits of the various Pflop/s designs. |
Need to develop a "common" set of applications with which to drive and evaluate various hardware and software approaches
|
Superconducting distributed memory
|
Distributed shared memory
|
A more aggressive PIM design
|
Latency tolerant algorithms will be essential
|
Hardware mechanisms for hiding and/or tolerating latency. |
Language facilities (such as programmer-directed prefetching) for managing latency. |
Need a "hierarchy little language" to describe latency structure |
Software tools, such as visualization facilities, for measuring and managing latency. |
Performance studies of distributed SMPs to study latency issues. |
On some original MPP's (e.g. hypercubes) one worried about nearest neighbor structure of algorithms so could match geometric structure of computers and problems. |
This became unnecessary and all that matters on most machines today is data locality
|
The year 2005 machines seem different as machine geometry seems relevant especially in PIM designs with physical transit time important as clock speeds increase. |
Today we see the standard trade-off between
|
Most people believe that it is easier to manage hierarchy (in an "automatic" fashion as in a compiler) than it is to manage distributed memory |
PIM gives us a classic distributed memory system as base design where as most other proposals (including Superconducting) give a shared memory as base design with hierarchy as issue.
|
It is difficult to compare the various point designs, because the application targets of each individual proposal are so different.
|
Research is thus needed to analyze the suitability of these systems for a handful of common grand challenge problems. |
Result: a set of quantitative requirements for various classes of systems. |
We made a start to this with a set of four "Petaflop kernels" |
Observation on Parallelism |
Consider what level of parallelism will be required for the proposed superconducting system: |
It will have 1024 processors, each with a clock frequency that is some 4000 times faster than the operational speed of DRAM main memory. |
For main memory intensive applications, at least 4000 threads, such as a 4000-long vector of data, must be streaming through each CPU. |
This means that the minimum level of parallelism required is 1024 x 4000 = 4 million. |
Conclusion: One of the key advantages of a superconducting system will not be met unless large amounts of superconducting memory is included in the system, enough so that the fetches to main memory are relatively infrequent. |
1. What is the Estimated cost of your system -- use the SIA estimate that semiconductor hardware will be 25 times cheaper per transistor in the 2006 time frame than it is in 1996. You may take $50 per Mbyte for 1996 memory and $5 per Mflop/s for 1996 sustained RISC processor performance (which roughly scales with transistor count), or else otherwise justify your estimate. |
2. What are your Programming and Execution models
|
3. What are your Hardware specifications -- peak performance rates of base processors, sizes and structures of the various levels of the memory hierarchy (including caches), latency and bandwidth of communication links between each level, etc. |
4. What are your Synchronization mechanisms; hardware fault tolerance facilities. |
5. What are your Mass storage, I/O and visualization facilities. |
6. Comment on support for extended precision (i.e., 128-bit and/or higher precision integer and floating-point arithmetic) -- hardware or software. |
7. Which of the ten key federal grand challenge applications will the proposed system address (aeronautical design, climate modeling, weather forecasting, electromagnetic cross-sections, nuclear stockpile stewardship, astrophysical modeling, drug design, material science, cryptology, molecular biology).
|
8. Discuss your reliance on novel hardware and/or software technology, to be developed by the commercial sector.
|
Quantitatively assess the performance of the proposed system on the following five computational problems, based on design parameters and execution model. Justify the analysis. |
a) Regular unit stride SAXPY Loop |
b) Large Stride Vector Fetch and Store |
c) Irregular Gather and Scatter |
d)Nearest Neighbor (local) Communication Algorithm |
e)Representation and Processing of Tree Structured Data |
In the first three kernels (a) (b) (c), n = 10^8; arrays are presumed appropriately dimensioned. |
a). SAXPY loop |
do i = 1, n
|
enddo |
b). Large-stride vector fetch and store |
do i = 1, n
|
enddo |
! This loop initializes the array idx with |
! pseudo-random numbers between 1 and n=10^8. |
do i = 1, n
|
enddo |
! Indexed loop: |
do i = 1, n
|
enddo |
This is a simple 5 deep nested loop over three spatial and two component indices with nearest neighbor spatial structure and "full" matrices for components |
This problem is defined with two sets of Parameters |
1)Large Grid but not many components: |
nc = 5; nx = 1000; ny = 1000; nz = 1000 |
2)Smaller Grid but more components as if many species: Repeat the above calculation with the parameters |
nc = 150; nx = 100; ny = 100; nz = 100 |
do k = 1, nz |
do j = 1, ny |
do i = 1, nx |
do m = 1, nc |
do mp = 1, nc |
a(i,j,k,m) = u(mp,m) * (b(i+1,j,k,mp) + b(i-1,j,k,mp) & |
b(i,j+1,k,mp) + b(i,j-1,k,mp) + b(i,j,k+1,mp) + & |
b(i,j,k-1,mp)) + a(i,j,k,m) |
enddo |
enddo |
enddo |
enddo |
enddo |
Consider a tree structure, such as |
o |
/ \ |
o o |
/ \ |
o o |
\ |
o |
except with 10^4 nodes and an arbitrary structure, with one random integer at each node.
|
The joint meeting of Application and Software Groups listed issues
|
Two classes of Problem
|
Need to prefetch irregular data |
Move regular and irregular data between levels of memory |
The Compiler User Interface needs hints and directives |
These need a "Hierarchy Little Language" to express memory heirarchy at all levels
|
Need (to investigate) Threads and Parallel Compilers for latency tolerance |
Need Latency Tolerant BLAS and higher level Capabilities
|
Performance should be monitored (with no software overhead) in hardware
|
Resource Management in presence of complex memory hierarchy |
Need to study separately the issues for
|
As always in year 2005 need latency-tolerant Operating Systems! |
Unclear message on need for Protection |
Need support for multithreading and sophisticated scheduling |
Scaling of Concurrent I/O ?
|
Message Passing
|
Need to be able to restart separate parts of the (large) hardware independently |
Need to support performance evaluation with different levels of memory hierarchy exposed |
Need Storage management and garbage collection
|
Need good job and resource management tools |
Checkpoint/restart only needed at natural synchoronization points |
June 17-18 1996 |
Geoffrey Fox, Chair |
Andrew Chien, Vice-Chair |
Define a "clean" model for machine architecture
|
Define a low level "Program Execution Model" (PEM) which allows one to describe movement of information and computation in the machine
|
On top of low level PEM, one can build an hierarchical (layered) software model
|
One can program at each layer of the software and augment it by "escaping" to a lower level to improve performance
|
MPI represents strucure of machines (such as original Caltech Hypercube) with just two levels of memory |
This addresses memory hierarchy intra-processor as well as inter-processor data movement |
Level 2 Cache |
Level 1 Cache |
5 Memory Hierarchy levels in each Processor |
Data Movement |
One needs distributed and shared memory constructs in the PEM |
One should look at extending HPF directives to refer to memory hierarchy |
It is interesting to look at adding directives to high level software systems such as those based on objects |
One needs (performance) predictability in lowest level PEM
|
One needs layered software tools to match layered execution software
|
It is possible that support of existing software (teraApps) may not be emphasis |
(un)reliability of software which cannot have the extensive testing of a system deployed on a lot of machines |
(un)portability of software which is targeted at these special machines |
Need for resilience to faults which are inevitable in such large machines |
Hardware implications of our software and machine models |
Software should be delivered at the same time as hardware! |
I/O |
Concurrency and Scaling |
June 19 1996 |
Geoffrey Fox, Chair |
Andrew Chien, Vice-Chair |
William Carlson, CCS, wwc@super.org |
Andrew Chien, UIUC, achien@cs.uiuc.edu |
Geoffrey Fox, Syracuse Univ, gcf@npac.syr.edu |
G.R. Gao, Delaware Univ., gao@cs.mcgill.ca |
Edwin Sha, Notre Dame, esha@cse.nd.edu |
Lennart Johnsson, johnsson@cs.uh.edu |
Carl Kesselman, Caltech, carl@compbio.caltech.edu |
Piyush Mehrotra, ICASE, pm@icase.edu |
David Padua. UIUC, padua@uiuc.edu |
Gail Pieper, pieper@mcs.anl.gov |
John Salmon, Caltech, johns@cacr.caltech.edu |
Vince Schuster, PGI, vinces@pgroup.com |
1)Deep Memory Hierarchies present New Challenges to High performance Implementation of programs
|
2)There are two dimensions of memory hierarchy management
|
3)One needs a machine "mode" which supports predictable and controllable memory system leading to communication and computation with same characteristics
|
4)One needs a low level software layer which provides direct control of the machine (memory hierarchy etc.) by a user program
|
5)One needs a layered (hierarchical) software model which supports an efficient use of multiple levels of abstraction in a single program.
|
6)One needs a set of software tools which match the layered software (programming model)
|
This is not really a simple stack but a set of complex relations between layers with many interfaces and modules |
Interfaces are critical ( for composition across layers)
|
Enable Next |
10000 |
Users |
For First 100 |
Pioneer Users |
Higher Level abstractions |
nearer to |
application domain |
Increasing Machine |
detail, control |
and management |
Application Specific Problem Solving Environment |
Coarse Grain Coordination Layer e.g. AVS |
Massively Parallel Modules -- such as DAGH HPF F77 C HPC++ |
Fortran or C plus generic message passing (get,put) and generic memory hierarchy and locality control |
Assembly Language plus specific (to architecture) data movement, shared memory and cache control |
High |
Level |
Low |
7)One needs hardware systems which can be used by system software developers to support software prototyping using emulation and simulation of proposed petacomputers. This will evaluate scaling of the software and reduce risk
|
8)One needs a benchmark set of applications which can be used to evaluate the candidate petacomputer architectures and their software
|
9)One needs to analyse carefully both the benchmark and other petaApplications to derive requirements on performance, capacity of essential subsystems (I/O, communication, computation including interlayer issues in hierarchy) of petacomputer |
10)The PetaComputer should be designed with a Cross disciplinary team including software application and hardware expertise |
11)For the initial "pioneer" users portability and tool/systems software robustness will be less important concerns than performance |
12)The broad set of users may require efficient support of current (by then legacy) programming interfaces such as MPI HPF HPC++ |
13)The petaComputer offers an opportunity to explore new software model unconstrained by legacy interfaces and codes |
14)Fault Resilience is essential in a large scale petaComputer |
1)Explore issues in design of petaComputer machine models which will support the controllable hierarchical memory systems in a range of important architectures
|
2)Explore techniques for control of memory hierarchy for petaComputer architectures
|
3)Explore issues in designing layered software architectures -- particularly efficient mapping and efficient interfaces to lower levels
|
4)Establish and use testbeds which supports study "at scale" of system software with realistic applications |
5)Explore the opportunity to design new software models unconstrained by legacy interfaces |
6)Establish model (benchmark) applications for petaflop machines |
Global Address Space
|
Cheap Synchronization with for instance Tag Bits |
Exposed Mechanisms for keeping the processor utilized during long latency operations
|
To be continued |