Given by Geoffrey Fox at Trip to China on July 12-28,96. Foils prepared July 6 1996
Abstract * Foil Index for this file
See also color IMAGE
We describe the structure of seven talks making up this review of HPCC from today to the Web and Petaflop performance in future |
Here we describe current status with HPCC in some sense both a failure and a great success |
This requires looking at hardware, software and the critical lack of commercial adoption of this technology |
We discuss COTS and trickle up and down technology strategies |
We describe education and interdisciplinary computational science in both simulation and information arenas |
This table of Contents
Abstract
http://www.npac.syr.edu/users/gcf/hpcc96status/index.html |
Presented during Trip to China July 12-28,1996 |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
We describe the structure of seven talks making up this review of HPCC from today to the Web and Petaflop performance in future |
Here we describe current status with HPCC in some sense both a failure and a great success |
This requires looking at hardware, software and the critical lack of commercial adoption of this technology |
We discuss COTS and trickle up and down technology strategies |
We describe education and interdisciplinary computational science in both simulation and information arenas |
There are seven talks in this series: |
HPCC Status -- this talk -- Overall Technical and Political Status |
HPCC Today I -- MPP Hardware Architectures and Machines |
HPCC Today II -- Software |
HPCC Today III -- Applications -- Grand Challenges Industry |
HPCC Tomorrow I -- Problem Solving Environments |
HPCC Tomorrow II -- Petaflop (10^15 Operations per second) in the year 2007? |
HPCC Tomorrow III -- The use of Web Technology in HPCC |
Parallel Computing Works! |
Technology well understood for Science and Engineering
|
Supercomputing market small (few percent at best) and probably decreasing in size
|
No silver programming bullet -- I doubt if new language will revolutionize parallel programmimng and make much easier
|
Social forces are tending to hinder adoption of parallel computing as most applications are areas where large scale computing already common
|
ATM ISDN Wireless Satellite advancing rapidly in commercial arena which is adopting research rapidly |
Social forces (deregulation in the U.S.A.) are tending to accelerate adoption of digital communication technologies
|
Not clear how to make money on Web(Internet) but growing interest/acceptance by general public
|
Integration of Communities and Opportunities
|
Technology Opportunities in Integration of High Performance Computing and Communication Systems
|
New Business opportunities linking Enterprise Information Systems to Community networks to current cable/network TV journalism |
New educational needs at interface of computer science and communications/information applications |
Major implications for education -- the Virtual University |
Performance of both communication networks and computers will increase by a factor of 1000 during the 1990's
|
Competitive advantage to industries that can use either or both High Performance Computers and Communication Networks. (United States clearly ahead of Japan and Europe in these technologies.) |
ATM networks have rapidly transitioned from research Gigabit networks to commercial deployment
|
Computer Hardware trends imply that all computers (PC's ---> Supercomputers) will be parallel by the year 2000
|
Software is challenge and could prevent/delay hardware trend that suggests parallelism will be a mainline computer architecture
|
Parallel Computing Works in Nearly all scientific and engineering applications |
As described in Book Parallel Computing Works! (Fox,Messina,Williams) |
The necessary algorithms are in general understood in most cases |
The implementation of -- especially adaptive irregular -- algorithms is not easy because: |
The software tools are immature and do not usually offer direct help for say:
|
There are several different approachs and not clear what will "win" and what will actually "work" when |
Need abstractions of the "hard" problem (component)s and toolkits to tackle them
|
Data Parallel and Message Parallel |
These are Message Parallel and Data Parallel Resources
|
There are trade-offs in Ease of Programming (not same for each user!), Portability, Maturity of Software, Generality of Problem class |
Message Parallel is most mature, somewhat less portable in principle but not necessarily in practice, tackles all problems and some consider painful to program
|
Switch from conventional to new types of technology is a phase transition |
Needs headroom (Carver Mead) which is large (factor of 10 ?) due to large new software investment |
Machines such as the nCUBE-1 and CM-2 were comparable in cost performance to conventional supercomputers
|
Cray T3D, Intel Paragon, CM-5, DECmpp (Maspar MP-2), IBM SP-2, nCUBE-3 have enough headroom to take over from traditional computers ? |
Originally $2.9 billion over 5 years starting in 1992 and
|
This drove race to Teraflop performance and is now OVER! |
The Grand Challenges
|
For "Convential" MPP/Distributed Shared Memory Architecture |
Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers
|
In 1999, one will see production 1 Teraflop systems |
In 2003, one will see production 10 Teraflop Systems |
In 2007, one will see production 50-100 Teraflop Systems |
Memory is Roughly 0.25 to 1 Terabyte per 1 Teraflop |
If you are lucky/work hard: Realized performance is 30% of Peak |
Most of the activities it started are ongoing! |
It achieved goal of Teraflop performance -- see Intel P6 machine at Sandia |
But it failed to develop a viable commercial base |
And although hardware peak performs at advertised rate, the software environment is poor
|
Academic Activities -- NSF Supercomputer centers -- are very healthy as much easier to put such codes on MPP as short in lifetime and lines of code |
Next initiatives -- based on PetaFlop goal -- will include a federal development as well as research component as can't assume "brilliant research" will be picked up by industry |
Everybody now believes in COTS -- Consumer On the Shelf Technology -- one must use commercial building blocks for any specialized system whether it be a DoD weopens program or high end Supercomputer
|
COTS for hardware can be applied to a greater or less extent
|
COTS for Software is less common but (I expect) to become much more common
|
Currently MPP's have COTS processors and specialized networks but this could reverse
|
Thus estimate that 250,000 transistors (excluding on chip cache) is optimal for performance per square mm of silicon
|
Again simplicity is optimal but this requires parallelism |
Contrary trend is that memory dominates use of silicon and so performance per square mm of silicon is often not relevant |
Performance data from uP vendors |
Transistor count excludes on-chip caches |
Performance normalized by clock rate |
Conclusion: Simplest is best! (250K Transistor CPU) |
Millions of Transistors (CPU) |
Millions of Transistors (CPU) |
Normalized SPECINTS |
Normalized SPECFLTS |
This is both Grand Challenges augmented by National Challenges but also |
Build HPCC technologies on a broad not niche base starting at bottom not top of computing pyramid |
Each of three components (network connections, clients, servers) has capital value of order $10 to $100 Billion |
Tightly Coupled MPP's were (SP2,Paragon,CM5 etc) distributed memory but at least at the low end they are becoming hardware assisted shared memory
|
Note this is an example of COTS at work -- SGI/Sun/.. Symmetric Multiprocessors (Power Challenge from SGI) attractive as bus will support upto 16 processors in elegant shared memory software world.
|
Clustering such SGI Power Challenge like systems produces a powerful but difficult to program (as both distributed and shared memory) heterogeneous system |
Meanwhile Tera Computer will offer a true Uniform Memory access shared memory using ingenious multi threaded software/hardware to hide latency
|
Trend I -- Hardware Assisted Tightly Coupled Shared Memory MPP's are replacing pure distributed memory systems |
Trend II -- The World Wide Web and increasing power of individual workstations is making geographically distributed heterogeneous distributed memory systems more attractive |
Trend III -- To confuse the issue, the technology trends in next ten years suggest yet different architecture(s) such as PIM |
Better use Scalable Portable Software with conflicting technology/architecture trends BUT must address latency agenda which isn't clearly portable! |
Functional Parallelism supported by sequential compiler intrachip in modern superscalar processor at very small grain size |
Multithreading as in Java supports object or small grain size functional parallelism within a Java Applet -- this can be implemented with message passing not shared memory |
Data Parallelism is source of Most scaling parallelism and needed for more than a few way speedup --
|
The Good News -- In 1993, NPAC selected to Partner with two Grand Challenge Groups as part of their "computer science" support |
Both Applications were rather suitable for HPF as regular grids |
The Bad News -- Both Application groups have abandoned HPF as couldn't wait for working compilers with appropriate features
|
Numerical Relativity needed adaptive mesh and this supplied by DAGH from Texas which is like HPF but simpler in overall capability but can handle adaptive meshes
|
In Data Assimilation, a NASA Goddard application, they are recoding in Fortran90 plus MPI as MUST meet specific performance goals in a year |
MetaProblems exhibit large grain size functional parallelism as in
|
Coordination or Integration Software Addresses this. |
See SCCS 757 (available from online NPAC reports) which discusses applications and software approaches for all types of parallelism |
Java Applet Model is natural approach which can be illustrated by new VRML 2.0 which will build on
|
Architectures such as PIM emphasis opportunities if we could develop software models/compilers which could effectively use substantial parallelism |
Current partyline microprocessors assume that substantial parallelism cannot be easily extracted from programs designed for single Chips |
One area we are exploring is how to extract from/build in parallelism for Java
|
A major new direction in Computer Science is "Problem Solving Environments" which are domain specific systems at a "higher level" than compilers which are a "toolkit" of components such as:
|
Computation joins theory and experiment as the three complementary approachs to study of science and engineering |
Current industries such as Media and Telecommunications which have been dominated by analog technologies will need to adjust to growing use of digital (computer) technologies |
Need for new educational approachs such as Computational Science centered on interdiciplinary border between computer science and applications with both a
|
Computational Science is an interdisciplinary field that integrates computer science and applied mathematics with a wide variety of application areas that use significant computation to solve their problems |
Includes the study of computational techniques
|
Includes the study of new algorithms, languages and models in computer science and applied mathematics required by the use of high performance computing and communications in any (?) important application
|
Includes computation of complex systems using physical analogies such as neural networks and genetic optimization. |
The fundamental principles behind the use of concurrent computers are identical to those used in society - in fact they are partly why society exists. |
If a problem is too large for one person, one does not hire a SUPERman, but rather puts together a team of ordinary people... |
cf. Construction of Hadrians Wall |
Domain Decomposition is Key to Parallelism |
Need "Large" Subdomains l >> l overlap |
AMDAHL"s LAW or |
Too many cooks spoil the broth |
Says that |
Speedup S is small if efficiency e small |
or for Hadrian's wall |
equivalently S is small if length l small |
But this is irrelevant as we do not need parallel processing unless problem big! |
"Pipelining" or decomposition by horizontal section is:
|
Hadrian's Wall is one dimensional |
Humans represent a flexible processor node that can be arranged in different ways for different problems |
The lesson for computing is: |
Original MIMD machines used a hypercube topology. The hypercube includes several topologies including all meshes. It is a flexible concurrent computer that can tackle a broad range of problems. Current machines use different interconnect structure from hypercube but preserve this capability. |
Comparing Computer and Hadrian's Wall Cases |
At the finest resolution, collection of neurons sending and receiving messages by axons and dendrites |
At a coarser resolution |
Society is a collection of brains sending and receiving messages by sight and sound |
Ant Hill is a collection of ants (smaller brains) sending and receiving messages by chemical signals |
Lesson: All Nature's Computers Use Message Passing |
With several different Architectures |
Problems are large - use domain decomposition Overheads are edge effects |
Topology of processor matches that of domain - processor with rich flexible node/topology matches most domains |
Regular homogeneous problems easiest but |
irregular or |
Inhomogeneous |
Can use local and global parallelism |
Can handle concurrent calculation and I/O |
Nature always uses message passing as in parallel computers (at lowest level) |
Data Parallelism - universal form of scaling parallelism |
Functional Parallelism - Important but typically modest speedup. - Critical in multidisciplinary applications. |
On any machine architecture
|
Simple, but general and extensible to many more nodes is domain decomposition |
All successful concurrent machines with
|
Have obtained parallelism from "Data Parallelism" or "Domain Decomposition" |
Problem is an algorithm applied to data set
|
The three architectures considered here differ as follows:
|
2 Different types of Mappings in Physical Spaces |
Both are static
|
Different types of Mappings -- A very dynamic case without any underlying Physical Space |
c)Computer Chess with dynamic game tree decomposed onto 4 nodes |