Given by Geoffrey C. Fox at Basic Simulation Track for Computational Science CPS615 on Fall Semester 97. Foils prepared 25 August 1997
Outside Index
Summary of Material
This introduces the course which covers the essential programming and algorithmic skills needed for large scale (serious) computing (aka simulation) |
We start with an overview of course itself and describe how it compares to other computational science offerings
|
In Introductory material, we describe some large simulations of interest (the so called Grand Challenges) |
An Overview of Technology trends driving the computer industry (in web and simulation areas!) |
Overview of computational science education program here and elsewhere |
Parallel Computing in Society and why it works in computers! |
Outside Index Summary of Material
http://www.npac.syr.edu/users/gcf/cps615intro97 |
Geoffrey Fox |
Syracuse University NPAC |
111 College Place Syracuse NY 13244 4100 |
3154432163 |
This introduces the course which covers the essential programming and algorithmic skills needed for large scale (serious) computing (aka simulation) |
We start with an overview of course itself and describe how it compares to other computational science offerings
|
In Introductory material, we describe some large simulations of interest (the so called Grand Challenges) |
An Overview of Technology trends driving the computer industry (in web and simulation areas!) |
Overview of computational science education program here and elsewhere |
Parallel Computing in Society and why it works in computers! |
Practical use of leading edge computer science technologies to address "real" applications |
CPS615/713: Simulation Track |
CPS606/616/640/714: Information Track |
Large Scale Parallel Computing Low latency Closely Coupled Components |
World Wide Distributed Computing Loosely Coupled Components |
Parallel Algorithms Fortran90 |
HPF MPI Interconnection Networks |
Transactions |
Security |
Compression |
PERL JavaScript Multimedia |
Wide Area Networks |
Java |
VRML Collaboration Integration (middleware) CORBA Databases |
These ideas come from Xiaoming Li, Peking University |
CPS615 |
Instructor: Geoffrey Fox -- gcf@npac.syr.edu, 3154432163 and Room 3-131 CST |
Backup: Nancy McCracken -- njm@npac.syr.edu, 3154434687 and Room 3-234 CST |
NPAC administrative support: Nora Downey-Easter -- nora@npac.syr.edu, 3154431722 and room 3-206 CST |
The above can be reached at cps615ad@npac.syr.edu |
Students will be on alias cps615@npac.syr.edu |
Homepage is: http://www.npac.syr.edu/projects/cps615fall97 |
There is a slightly out of date paper "Basic Issues and Current Status of Parallel Computing" by Fox (SCCS736 on NPAC technical reports) |
Graded on the basis of approximately 7 Homework sets which will be due Thursday of the week following day (Tuesday or Thursday given out) |
There will be two projects -- one will start after message passing (MPI) discussed, the other about 60% through the course |
Total grade is 50% homework, 50% the sum of two projects |
All homework will be handled through the web and indeed all computer access will be through the VPL or Virtual Programming Laboratory which gives access to compilers, Java visualization etc. through the web |
Status of High Performance Computing and Computation HPCC nationally
|
What is Computational Science Nationally (and at Syracuse) |
Technology driving forces (until around 2010)
|
Elementary discussion of Parallel Computing in Society and why it must obviously work in simulation! |
Sequential and Parallel Computer Architectures |
Comparison of Architecture of World Wide Web and Parallel Systems |
Simple base Example -- Laplace's Equation
|
Then we discuss software and algorithms (mini-applications) intermixed |
Programming Models -- SPMD and Message Passing (MPI), Software Integration middleware (Java, WebFlow), Data Parallel (HPF, Fortran90) |
Visualization -- Java Applets |
Other tools: Collaboration , Performance Measurement |
Large Scale Simulations in Engineering
|
Large Scale Academic Simulations (Physics, Chemistry, Biology)
|
"Other Critical Real World Applications"
|
f is the size of basic lines on chip and is now 0.25nm and will decrease inexorably to 100 nm in 2005-2010 time period
|
Roughly number of Transistors is proportional to (1/f)2 |
Speed of chip is proportional to (1/f)
|
Figure of Merit is (1/f)3 |
Die size is also increasing like (1/f) and this enhances effect! |
These foils come from summary by David Bailey |
First of all, there seems to be general agreement that Moore's Law will not stop anytime soon. |
The current state of the art in the semiconductor industry is 250 nm, as recently announced by Intel among others. Researchers are confident that current approaches can be used for at least another three generations with their current approach. |
In the years ahead we may even see some manufacturers skip a generation, proceeding directly to significantly smaller feature sizes. This means that the 100 nm technology wall will be reached earlier than previously anticipated. |
Below about 100 nm feature sizes, progress will be more difficult, for various well known reasons, among them the lack of any substance that can be used as a lens for photolithography at such wavelengths. |
Nonetheless, there are a number of "tricks" in the works, including the usage of diffraction gratings, parallel exposures, and others. |
Groups working on these "ultimate" silicon techniques include Frances Houle's group at IBM Almaden and Jeff Baker's group at UC Berkeley. |
Also helping will be some improvements such as better materials and increased wafer sizes. The consensus is that these techniques will be good for another two generations, to about 50 nm. Thus Moore's Law should continue until about 2010 or 2012. |
Below about 50 nm feature sizes, it appears that a completely new device fabrication technology is needed. |
Progress in electron beams and X-rays, which were the leading candidates of a few years ago, has stalled.
|
The researchers I spoke with nonetheless agree that future devices will be electronic for the foreseeable future. |
As Stan Williams of HP points out, electrons are the ideal basis for device technology because they have basically zero size and mass, can travel at large fractions of the speed of light and interact strongly with matter. |
In contrast, DNA computing and the like are still curiosities with no clear path to practical high-speed computing. |
One exciting development, which has emerged just in the past year or two, is nanotubes, i.e. cylindrical buckyballs. |
It was recently discovered that a nanotube can be "tuned" from insulator to semiconductor to conductor just by changing the pitch of the helical structure of carbon atoms. |
Kinks introduced in a tube can be used to form a conductor-semiconductor junction.
|
More than one person with whom Bailey talked emphasized that future research should not focus exclusively on fabrication technology. |
An equally important question is what is worth fabricating at these small feature sizes. This is because conventional transistor-based logic appears increasingly troublesome below about 50 nm. |
Thus a new paradigm in logic design is as much needed as a new paradigm in fabrication technology. |
Several at the meeting were aware of Likharev's rapid single flux quantum (RSFQ) superconducting logic. They agree that it is a very interesting development, and further that it is fairly certain to work well. |
However, they feel that the requirement for liquid helium temperatures is a severe impediment to widespread utilization, and without commodity production it is unlikely to be economic.
|
According to Stan Williams, recently Likarev came to Hewlett Packard. |
He offered HP all of his patents, only asking that they develop the technology. |
But HP management still declined. Williams quipped that if some company decided to produce RSFQ technology, somehow assembling the $100 million or so required in start-up capital, the day after they produce their first devices it would pronounced a great success, but on the second day the company would fold financially. |
One interesting possibility for a future logic system is "quantum dot cellular automata", abbreviated QCA. |
Wolfgang Porod of the University of Notre Dame gave a talk on these devices at Ames workshop. His group has proposed configurations of cells, each of which is a collection of five electron sites arranged in an X, with two free electrons within the cell. |
According to their studies, configurations of these cells can propagate signals, invert signals, form majority vote logic, and thus can be formed to produce practical devices. |
The devices are semi-classical, semi-quantum, in that interactions within a cell can involve quantum tunneling, but interactions between cells involve only classical electromagnetic effects. |
Like RSFQ, initial demonstration devices of QCA logic will need to be super-cooled, assuming fabrication feature sizes in the tens of nanometers. |
But unlike RSFQ (from what we know of RSFQ), the smaller the fabrication dimension, the higher the tolerable temperature, according to thermodynamic analyses. |
Indeed, if one could eventually fabricate QCA devices in the realm of 2-5 nanometers, then they could operate at room temperature! |
HPCC is a maturing field with many organizations installing large scale systems |
These include NSF (academic computations), DoE (Dept of Energy) and DoD (Defense) |
There are new applications with new algorithmic challenges
|
However these are not "qualitatively" new concepts |
Software ideas are "sound" but note simulation is only a few percent of total computer market
|
Transistors are getting cheaper and cheaper and it only takes some 0.5 million transistors to make a very high quality CPU
|
Already we build chips with some factor of ten more transistors than this and this is used for "automatic" instruction level parallelism.
|
However getting much more speedup than this requires use of "outer loop" or data parallelism. |
Actually memory bandwidth is an essential problem in any computer as doing more computations per second requires accessing more memory cells per second!
|
Data locality implies CPU finds information it needs in cache which stores most recently accessed information |
This means one reuses a given memory reference in many nearby computations e.g. |
A1 = B*C |
A2 = B*D + B*B |
.... Reuses B |
L3 Cache |
Main |
Memory |
Disk |
Increasing Memory |
Capacity Decreasing |
Memory Speed (factor of 100 difference between processor |
and main memory |
speed) |
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication" |
Data locality addresses in both cases |
Differences are quantitative size of effect and what is done by user and what automatically |
Now we go through a set of semi-realistic examples of parallel computing and show they use various forms of data parallelism |
Seismic Wave Simulation: Regular mesh of grid points |
Cosmology (Universe) Simulation: Irregular collection of stars or galaxies |
Computer Chess: "Data" is now parts of a game tree with major complication of pruning algorithms
|
Defending the Nation: Tracking multiple missiles achieves parallelism from set of missiles |
Finite Element (NASTRAN) structures problem: Irregular collection of nodal (grid) points clustered near a crack |
http://www.npac.syr.edu/users/gcf/cps615arch97 |
Geoffrey Fox |
Syracuse University NPAC |
111 College Place Syracuse NY 13244 4100 |
3154432163 |
We describe the simple nature of parallel machines like the nCUBE2 |
Technologies include conventional, superconducting and Quantum systems but the former dominates |
Sequential architectures are dominated by memory (cache) issues |
Vector, SIMD MIMD are classic architectures |
There is a growing interest in distributed metacomputers |
Special purpose machines are typically not so successful |
The details of networks is not as important as it used to be! |
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where |
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.) |
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle |
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody! |
In computer case, stall is caused typically by data not being ready for a particular instruction |
Taken from David Patterson CS252 Berkeley Fall 1996 |
3 Functional Units FP Adder, FP Multiplier, FP Divider |
8 Kinds of Stages in FP Units |
Stage Functional Unit Description |
A FP Adder Mantissa ADD stage |
D FP Divider Divide Pipeline stage |
E FP Multiplier Exception Test stage |
M FP Multiplier First stage of multiplier |
N FP Multiplier Second stage of multiplier |
R FP Adder Rounding stage |
S FP Adder Operand Shift stage |
U Unpack Floating point numbers |
Several different pipelines with different lengths! |
Add,Subtract- 4 clocks:U S+A A+R R+S |
Multiply - 8 clocks: U E+M M M M N+A R |
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R |
Square Root- 110 clocks: U E (A+R)(108) A R |
Negate- 2 clocks: U S |
Absolute Value- 2 clocks: U S |
Floating Point Compare- 3 clocks:U A R |
Suppose two processors cache the same variable stored in memory of one of the processors |
One must ensure cache coherence so that when one cache value changes, all do! |
Interconnection Network |
.... |
.... |
Interconnection Network |
Cached Value of same shared variable |
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence |
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
|
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems |
Cray T3E offers a uniform (but not as good for small number of nodes) interface |
Message passing / distributed memory is natural web model |