Given by Geoffrey Fox at Trip to China on July 12-28,96. Foils prepared July 6 1996
Abstract * Foil Index for this file
See also color IMAGE
We describe basic technology driver -- the CMOS Juggernaut -- and some new approaches that could be important 10-20 years from now |
We describe from elementary point of view the basics of parallel(MPP) architectures |
We discuss current situation for tightly coupled systems -- convergence to distributed shared memory |
We discuss clusters of PC's/workstations -- MetaComputing |
This table of Contents
Abstract
http://www.npac.syr.edu/users/gcf/hpcc96hardware/index.html |
Presented during Trip to China July 12-28,1996 |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
We describe basic technology driver -- the CMOS Juggernaut -- and some new approaches that could be important 10-20 years from now |
We describe from elementary point of view the basics of parallel(MPP) architectures |
We discuss current situation for tightly coupled systems -- convergence to distributed shared memory |
We discuss clusters of PC's/workstations -- MetaComputing |
The future sees:
|
We will discuss the first two here, the next three in "Petaflop futures" and I leave the last two to a future generation! |
RAM density increases by about a factor of 50 in 8 years |
Supercomputers in 1992 have memory sizes around 32 gigabytes (giga = 109) |
Supercomputers in year 2000 should have memory sizes around 1.5 terabytes (tera = 1012) |
Computer Performance is increasing faster than RAM density |
See Chapter 5 of Petaflops Report -- July 95 |
See Chapter 5 of Petaflops Report -- July 95 |
Two critical Issues are: |
Memory Structure
|
and Heterogeneous mixtures |
Control and synchronization |
SIMD -lockstep synchronization |
MIMD -synchronization can take several forms
|
Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. Pieces, processors may be heterogeneous. Computation/ communication ratio very high -- Typical of Networked Metacomputing |
Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. Processors typically run the same code.(SPMD Style) Computation/communication ration often hundreds or more. Typical of MIMD Parallel Systems such as SP2 CM5 Paragon T3D |
Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. Processors typically have instructions broadcasted to them.Computation/ Communication ratio often near unity. Typical of SIMD but seen in a few MIMD systems such as Dally's J Machine or commercial Myrianet (Seitz) |
Note that a machine of one type can be used on algorithms of the same or finer granularity |
Shared (Global): There is a global memory space, accessible by all processors. Processors may also have some local memory. Algorithms may use global data structures efficiently. However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times) |
Distributed (Local, Message-Passing): All memory is associated with processors. To retrieve information from another processors' memory a message must be sent there. Algorithms should use distributed data structures. |
Uniform: All processors take the same time to reach all memory locations. |
Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
|
Most NUMA machines these days have two memory access times
|
Classes of networks include: |
Bus: All processors (and memory) connected to a common bus or busses.
|
Switching Network: Processors (and memory) connected to routing switches like in telephone system.
|
Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube. |
Communication (operating system) software ensures that systems appears fully connected even if physical connections partial |
Transmission Time for message of n bytes: |
T0 + T1 n where |
T0 is latency containing a term proportional to number of hops. It also has a term representing interrupt processing time at beginning at and for communication network and processor to synchronize |
T0 = TS + Td . Number of hops |
T1 is the inverse bandwidth -- it can be made small if pipe is large size. |
In practice TS and T1 are most important and Td is unimportant |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Square blocks indicate shared memory copy performance |
Dongarra and Dunigan: Message-Passing Performance of Various Computers, August 1995 |
Today we see the following "CMOS Juggernaut" Architectures |
SIMD: No commercial or academic acceptance except for special purpose military (signal processing) and commercial(database indexing) applications |
Special Purpose: Such as GRAPE N-body machine which achieves a Teraflop today and a petaflop in a few years -- requires small memory and small CPU's |
MIMD Distributed Memory:
|
Shared Memory
|
Expected Architectures of Future will be:
|
Essentially all problems run efficiently on a distributed memory BUT |
Software is easier to develop on a shared memory machine |
Some Shared Memory Issues:
|
Scalable Network (ATM) and Platforms (PC's running Windows 95) |
MetaComputing Built from PC's and ATM as commodity parts (COTS) |
The Computing World from Smart Card to Enterprise Server |
see Original |
Vast numbers of under utilised workstations available to use. |
Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas. |
Reluctance to buy Supercomputer due to their cost and short life span. |
Distributed compute resources fit better into todays funding model. |
Parallel Computing |
Communication - high bandwidth and low latency. |
Low flexibility in messages (point-to-point). |
Distributed Computing |
Communication can be high or low bandwidth. |
Latency typically high -- can be very flexible messages involving fault tolerance, sophisticated routing, etc. |
Why use Distributed Computing Techniques ? |
Expense of buying, maintaining and using traditional MPP systems. |
Rapid increase in commodity processor performance. |
Commodity networking technology (ATM/FCS/SCI) of greater than 200 Mbps at present with expected Gbps performance in the very near future. |
The pervasive nature of workstations in academia and industry. |
Price/Performance of using existing hardware/software. |
Comms1 - From ParkBench Suite |
High Initial and Maintenance Costs
|
Applications Development
|