Given by Peter Kogge Notre Dame at PAWS 96 Mandalay Beach on April 21-26 1996. Foils prepared June 1996
Outside Index
Summary of Material
This was part of a set of PAWS 96(Mandalay Beach) Presentations |
Kogge and Collaboraters describe PIM as an emerging architecture where logic and memory combined on same chip which increases memory bandwidth naturally |
Conventional Architectures tend to waste transistors measured in terms silicon used per unit operation |
Both Existing designs and projections to PetaFlop timescale(2007) are given |
Outside Index Summary of Material
Dr. Peter M. Kogge |
McCourtney Prof. of Computer Science & Engr. |
IEEE Fellow, IBM Fellow |
Dr. Jay B. Brockman, Assistant Prof. |
Dept. of Computer Science & Engr. |
University of Notre Dame |
219-631-6763 |
kogge@cse.nd.edu |
Project Origins: IBM FSD EXECUBE chip
|
Current project funding includes:
|
As technology expands, so does the Gap |
"Bridging the Gap:" significant cost implications |
PIM changes the game's groundrules |
Gap |
Today's High Performance Systems
|
PIM: Combining memory & logic |
Emerging single part type PIM designs |
Future projections |
Ongoing projects |
Lessons learned |
New Design Space Exploration |
At Best First Order Interactions Identified |
Software Tools, in particular, need detail |
Analysis only as good as projections |
HPCC Program: Conceived in late 1980's |
Key Goal: Tera(fl)op by turn of century
|
Major impediment: 1 TF = BIG MACHINE! |
Final component: Embedded HPCC
|
We have declared Success! |
Next Goal: Petaflops! |
Feb. `94: Pasadena Workshop on Enabling Technologies for Petaflops Computing Systems |
March `95: Petaflops Workshop at Frontiers'95 |
Aug. `95: Bodega Bay Workshop on Applications |
PETA online: http://cesdis.gsfc.nasa.gov/petaflops/peta.html |
Jan. `96: NSF Call for 100 TF "Point Designs" |
April `96: Oxnard Petaflops Architecture Workshop (PAWS) on Architectures |
June `96: Bodega Bay Petaflops Workshop on System Software |
Oct. `96: Workshop at Frontiers `96 |
Goal: Systems, Applications, & Software for PetaFlop (10^15) |
Applications: Significant number |
Technology: Several, inc. CMOS |
Architecture: At least 3 |
Software:"Like your crazy Uncle Fred who no one wants to talk about" |
Huge Problem Size
|
Real Time Computations
|
Indirect Needs for Petaflops technology
|
Today:
|
Future:
|
NOT COUNTING MEMORY! |
"Glows in the Dark" |
Wasted silicon |
Wasted bandwidth |
Wasted contacts |
Wasted power |
Unnecessary complexity |
Memory |
Subsystem |
Bus Interface |
Secondary |
Cache |
Cache |
CPU |
Bandwidth |
Loss |
Bandwidth |
Loss |
With modern DRAMs:
|
Multiplexor |
Row Buffer |
Row Buffer |
Multiple |
internal |
banks |
256-4096 bits wide |
select 1-9 bits: maybe 30-50 MB/s |
Nibble, Page, Fast Page Mode |
Video RAMs |
Pipelined Extended Data Out RAMs |
Dual Bank Synchronous RAMs |
Block Transfer RAMBUS |
We are adding logic to speed up bandwidth, |
BUT STILL LIMITED BY TAKING DATA OFF CHIP! |
Place processing logic on the memory chip |
Use logic to consume memory bandwidth directly
|
Utilize newly "liberated" chip contacts
|
Ample bandwidth changes design philosophy
|
"Big Science" Supercomputing
|
"Point design" Accelerators |
Industrial Supercomputing |
Commodity PCs & Workstations |
Consumer Applications |
Architectural |
Design |
Complexity |
Available |
Development |
Resources |
"Across-the-Board" |
need for cheaper, |
denser, lower power |
Storage |
0.5 MB |
0.5 MB |
0.05 MB |
0.128 MB |
Chip |
EXECUBE |
AD SHARC |
TI MVP |
MIT MAP |
Terasys PIM |
First |
Silicon |
1993 |
1994 |
1994 |
1996 |
1993 |
Peak |
50 Mips |
120 Mflops |
2000 Mops |
800 Mflops |
625 M bit |
ops |
0.016 MB |
MB/ |
Perf. |
0.01 |
MB/Mip |
0.005 |
MB/MF |
0.000025 |
MB/Mop |
0.00016 |
MB/MF |
0.000026 |
MB/bit op |
Organization |
16 bit |
SIMD/MIMD CMOS |
Single CPU and |
Memory |
1 CPU, 4 DSP's |
4 Superscalar |
CPU's |
1024 |
16-bit ALU's |
Conventional GP Computation Wisdom:
|
May be somewhat reduced at high end
|
Graphics & other embedded applications
|
IBM: 4 Mbit (1991), 16 Mbit (1994) |
Toshiba: 8 Mbit (1994) |
Mitsubitsi: 10 Mbit (1994) |
Hitachi: 4 Mbit (1995) |
Samsung: 4 Mbit (1996) |
NEC: 8 Mbit (1996) |
4 Mbit DRAM + 100K Gate base: 5V, 2.7W |
Single part type: NO GLUE! |
SIMD & MIMD - In Any Combination |
Huge increase in BW, pins, silicon.. utilization |
register |
instruction |
register |
register |
array |
register |
32Kx9 DRAM |
macro |
32Kx9 DRAM |
macro |
ALU |
SIMD |
MIMD |
mode |
SIMD |
Broadcast Bus |
decode |
logic |
DMA |
logic |
DMA Channel and |
Link Control |
16-Bit CPU |
Absolute Need: DENSEST POSSIBLE Memory |
Single Part Type => True Scalable Systems |
Bandwidth "Next Door" => Simpler CPU |
Integrated, Fast I/O => Simpler Apps |
Mixed SIMD/MIMD => Simple Parallelization |
Next Time:
|
Combined Memory Bus/Chip Control
|
Chip to Chip Links to support scalable MPPs |
External system links:
|
Huge bandwidths available to processing logic
|
Tremendous internode bandwidths
|
Huge bandwidths available at chip periphery |
2D tiling prevents wires "over memory" |
Opportunity for "mix and match"
|
Today's "conventional wisdom:"
|
Does that make sense in PIM environment?
|
Answer: No! Better choice: design for:
|
Performance data from uP vendors |
Transistor count excludes on-chip caches |
Performance normalized by clock rate |
Conclusion: Simplest is best! (250K Transistor CPU) |
Millions of Transistors (CPU) |
Millions of Transistors (CPU) |
Normalized SPECINTS |
Normalized SPECFLTS |
MB per cm2 |
MF per cm2 |
MB/MF ratios |
MB/MF ratios |
Processors needed for Teraflop |
MB/MF ratios |
Processors per cm2 |
By 2010: Same logic = |
* 1/50 to 1/100th the power |
* 100X the performance per Watt |
1 MB/MF |
We Must! Look at Lower Power Designs |
Goal: create 3D cubes of silicon |
Process:
|
Stacks of 70 or more have been demonstrated |
Ideal for PIMs
|
Problems Today: >2 side wiring, Power |
Power, Power, Power! |
Core "conventional" CPU selection:
|
Optimal PIM ISA & Organization
|
Optimized PIM memory macro
|
Selection of I/O Protocols |
Integrated CAD support |
Find inherently low MB/MF algorithms |
Express as "In the Memory" operations |
Utilize huge degrees of parallelism |
Utilize rich mix of parallel styles
|
Utilize very large inter node bandwidths |
Local node code for "In the Memory" operations
|
Node run time kernal
|
Host control
|
Graduate Student Projects:
|
NASA: Future PIM design space for petaflops |
NSF: Inherently low power ISAs |
NEC: PIM-based Image database search |
ARPA: PIM Foundries & DA Tools |
NSF:Point Designs for 100 TFlops |
NASA: Rad Hard PIM for Spacecraft |
PIM: Potential Breakthrough
|
BUT: "Breaking the Rules & Changing the Game"
|
To make PIM mainstream:
|