Full HTML for

Basic foilset Master Set of Foils for 1997 Session of CPS615

Given by Geoffrey C. Fox at Basic Simulation Track for Computational Science CPS615 on Fall Semester 97. Foils prepared 25 August 1997
Outside Index Summary of Material


This introduces the course which covers the essential programming and algorithmic skills needed for large scale (serious) computing (aka simulation)
We start with an overview of course itself and describe how it compares to other computational science offerings
  • We point out new modules on web based computing
In Introductory material, we describe some large simulations of interest (the so called Grand Challenges)
An Overview of Technology trends driving the computer industry (in web and simulation areas!)
Overview of computational science education program here and elsewhere
Parallel Computing in Society and why it works in computers!

Table of Contents for full HTML of Master Set of Foils for 1997 Session of CPS615

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 Base Course in Simulation Track of Computational Science CPS615 Fall Semester 97
2 Abstract of CPS615 Introductory Lecture
3 What is Computational Science ?
4 Internetics is Another Way of Thinking
5 Basic CPS615 Contact Points
6 Course Organization
7 Material Covered in this Course
8 Structure of CPS615 - II
9 Motivating Applications
10 Effect of Feature Size on Performance
11 Ames Summer 97 Workshop on Device Technology -- Moore's Law - I
12 Ames Summer 97 Workshop on Device Technology -- Moore's Law - II
13 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies I
14 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies II
15 Ames Summer 97 Workshop on Device Technology -- New Logic Concepts
16 Ames Summer 97 Workshop on Device Technology -- RSFQ
17 Ames Summer 97 Workshop on Device Technology -- QCA -I
18 Ames Summer 97 Workshop on Device Technology -- QCA -II
19 Some Comments on Simulation and HPCC
20 Parallel Computing Rationale
21 Sequential Memory Structure
22 Parallel Computer Memory Structure
23 Back from the Analogy to Parallel Computers!
24 Introduction to Computer Architectures CPS615 Fall Semester 97
25 Abstract of CPS615 Architecture Discussion
26 What is a Pipeline -- Cafeteria Analogy?
27 Example of MIPS R4000 Floating Point
28 MIPS R4000 Floating Point Stages
29 Cache Coherent or Not?
30 Choices in Cache Coherence

Outside Index Summary of Material



HTML version of Basic Foils prepared 25 August 1997

Foil 1 Base Course in Simulation Track of Computational Science CPS615 Fall Semester 97

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
http://www.npac.syr.edu/users/gcf/cps615intro97
Geoffrey Fox
Syracuse University NPAC
111 College Place Syracuse NY 13244 4100
3154432163

HTML version of Basic Foils prepared 25 August 1997

Foil 2 Abstract of CPS615 Introductory Lecture

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
This introduces the course which covers the essential programming and algorithmic skills needed for large scale (serious) computing (aka simulation)
We start with an overview of course itself and describe how it compares to other computational science offerings
  • We point out new modules on web based computing
In Introductory material, we describe some large simulations of interest (the so called Grand Challenges)
An Overview of Technology trends driving the computer industry (in web and simulation areas!)
Overview of computational science education program here and elsewhere
Parallel Computing in Society and why it works in computers!

HTML version of Basic Foils prepared 25 August 1997

Foil 3 What is Computational Science ?

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Practical use of leading edge computer science technologies to address "real" applications
CPS615/713: Simulation Track
CPS606/616/640/714: Information Track
Large Scale Parallel Computing Low latency Closely Coupled Components
World Wide Distributed Computing Loosely Coupled Components
Parallel Algorithms Fortran90
HPF MPI Interconnection Networks
Transactions
Security
Compression
PERL JavaScript Multimedia
Wide Area Networks
Java
VRML Collaboration Integration (middleware) CORBA Databases

HTML version of Basic Foils prepared 25 August 1997

Foil 4 Internetics is Another Way of Thinking

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
These ideas come from Xiaoming Li, Peking University
CPS615

HTML version of Basic Foils prepared 25 August 1997

Foil 5 Basic CPS615 Contact Points

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Instructor: Geoffrey Fox -- gcf@npac.syr.edu, 3154432163 and Room 3-131 CST
Backup: Nancy McCracken -- njm@npac.syr.edu, 3154434687 and Room 3-234 CST
NPAC administrative support: Nora Downey-Easter -- nora@npac.syr.edu, 3154431722 and room 3-206 CST
The above can be reached at cps615ad@npac.syr.edu
Students will be on alias cps615@npac.syr.edu
Homepage is: http://www.npac.syr.edu/projects/cps615fall97
There is a slightly out of date paper "Basic Issues and Current Status of Parallel Computing" by Fox (SCCS736 on NPAC technical reports)

HTML version of Basic Foils prepared 25 August 1997

Foil 6 Course Organization

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Graded on the basis of approximately 7 Homework sets which will be due Thursday of the week following day (Tuesday or Thursday given out)
There will be two projects -- one will start after message passing (MPI) discussed, the other about 60% through the course
Total grade is 50% homework, 50% the sum of two projects
All homework will be handled through the web and indeed all computer access will be through the VPL or Virtual Programming Laboratory which gives access to compilers, Java visualization etc. through the web

HTML version of Basic Foils prepared 25 August 1997

Foil 7 Material Covered in this Course

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Status of High Performance Computing and Computation HPCC nationally
  • Including web linked simulation examples
What is Computational Science Nationally (and at Syracuse)
Technology driving forces (until around 2010)
  • Moore's law and exponentially increasing transistors
Elementary discussion of Parallel Computing in Society and why it must obviously work in simulation!
Sequential and Parallel Computer Architectures
Comparison of Architecture of World Wide Web and Parallel Systems

HTML version of Basic Foils prepared 25 August 1997

Foil 8 Structure of CPS615 - II

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Simple base Example -- Laplace's Equation
  • Illustrate parallel computing
  • Illustrate use of Web
Then we discuss software and algorithms (mini-applications) intermixed
Programming Models -- SPMD and Message Passing (MPI), Software Integration middleware (Java, WebFlow), Data Parallel (HPF, Fortran90)
Visualization -- Java Applets
Other tools: Collaboration , Performance Measurement

HTML version of Basic Foils prepared 25 August 1997

Foil 9 Motivating Applications

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Large Scale Simulations in Engineering
  • Model airflow around an aircraft
  • Study environmental issues -- flow of contaminants
  • Forecast weather
  • Oil Industry: Reservoir Simulation and analysis of Seismic data
Large Scale Academic Simulations (Physics, Chemistry, Biology)
  • Study of Evolution of Universe
  • Study of fundamental particles: Quarks and Gluons
  • Study of protein folding
  • Study of catalysts
"Other Critical Real World Applications"
  • Run optimization and classification algorithms in datamining of Enterprise Information Systems
  • Model Financial Instruments

HTML version of Basic Foils prepared 25 August 1997

Foil 10 Effect of Feature Size on Performance

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
f is the size of basic lines on chip and is now 0.25nm and will decrease inexorably to 100 nm in 2005-2010 time period
  • 1992 Digital alpha chip was f= 750 nm. feature size and 1.7 x 106 transistors
  • 1995 Sun UltraSparc was f= 500 nm. and 5.2 X 106 transistors
Roughly number of Transistors is proportional to (1/f)2
Speed of chip is proportional to (1/f)
  • This is no longer true as "wires" get so thin that propagation speed decreases and so clock speed levels off around 1 gigaherz
Figure of Merit is (1/f)3
Die size is also increasing like (1/f) and this enhances effect!

HTML version of Basic Foils prepared 25 August 1997

Foil 11 Ames Summer 97 Workshop on Device Technology -- Moore's Law - I

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
These foils come from summary by David Bailey
First of all, there seems to be general agreement that Moore's Law will not stop anytime soon.
The current state of the art in the semiconductor industry is 250 nm, as recently announced by Intel among others. Researchers are confident that current approaches can be used for at least another three generations with their current approach.
In the years ahead we may even see some manufacturers skip a generation, proceeding directly to significantly smaller feature sizes. This means that the 100 nm technology wall will be reached earlier than previously anticipated.

HTML version of Basic Foils prepared 25 August 1997

Foil 12 Ames Summer 97 Workshop on Device Technology -- Moore's Law - II

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Below about 100 nm feature sizes, progress will be more difficult, for various well known reasons, among them the lack of any substance that can be used as a lens for photolithography at such wavelengths.
Nonetheless, there are a number of "tricks" in the works, including the usage of diffraction gratings, parallel exposures, and others.
Groups working on these "ultimate" silicon techniques include Frances Houle's group at IBM Almaden and Jeff Baker's group at UC Berkeley.
Also helping will be some improvements such as better materials and increased wafer sizes. The consensus is that these techniques will be good for another two generations, to about 50 nm. Thus Moore's Law should continue until about 2010 or 2012.

HTML version of Basic Foils prepared 25 August 1997

Foil 13 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies I

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Below about 50 nm feature sizes, it appears that a completely new device fabrication technology is needed.
Progress in electron beams and X-rays, which were the leading candidates of a few years ago, has stalled.
  • No one yet has any good idea how to scale e-beam lithography to economical production, and X-rays seem to destroy everything in their path, including the masks and devices. If something does eventually materialize along these lines, it won't be easy.
The researchers I spoke with nonetheless agree that future devices will be electronic for the foreseeable future.
As Stan Williams of HP points out, electrons are the ideal basis for device technology because they have basically zero size and mass, can travel at large fractions of the speed of light and interact strongly with matter.

HTML version of Basic Foils prepared 25 August 1997

Foil 14 Ames Summer 97 Workshop on Device Technology -- Alternate Technologies II

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
In contrast, DNA computing and the like are still curiosities with no clear path to practical high-speed computing.
One exciting development, which has emerged just in the past year or two, is nanotubes, i.e. cylindrical buckyballs.
It was recently discovered that a nanotube can be "tuned" from insulator to semiconductor to conductor just by changing the pitch of the helical structure of carbon atoms.
Kinks introduced in a tube can be used to form a conductor-semiconductor junction.
  • Molecular modeling is being used to further explore some of the possibilities.

HTML version of Basic Foils prepared 25 August 1997

Foil 15 Ames Summer 97 Workshop on Device Technology -- New Logic Concepts

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
More than one person with whom Bailey talked emphasized that future research should not focus exclusively on fabrication technology.
An equally important question is what is worth fabricating at these small feature sizes. This is because conventional transistor-based logic appears increasingly troublesome below about 50 nm.
Thus a new paradigm in logic design is as much needed as a new paradigm in fabrication technology.

HTML version of Basic Foils prepared 25 August 1997

Foil 16 Ames Summer 97 Workshop on Device Technology -- RSFQ

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Several at the meeting were aware of Likharev's rapid single flux quantum (RSFQ) superconducting logic. They agree that it is a very interesting development, and further that it is fairly certain to work well.
However, they feel that the requirement for liquid helium temperatures is a severe impediment to widespread utilization, and without commodity production it is unlikely to be economic.
  • See later for details and other difficulties!
According to Stan Williams, recently Likarev came to Hewlett Packard.
He offered HP all of his patents, only asking that they develop the technology.
But HP management still declined. Williams quipped that if some company decided to produce RSFQ technology, somehow assembling the $100 million or so required in start-up capital, the day after they produce their first devices it would pronounced a great success, but on the second day the company would fold financially.

HTML version of Basic Foils prepared 25 August 1997

Foil 17 Ames Summer 97 Workshop on Device Technology -- QCA -I

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
One interesting possibility for a future logic system is "quantum dot cellular automata", abbreviated QCA.
Wolfgang Porod of the University of Notre Dame gave a talk on these devices at Ames workshop. His group has proposed configurations of cells, each of which is a collection of five electron sites arranged in an X, with two free electrons within the cell.
According to their studies, configurations of these cells can propagate signals, invert signals, form majority vote logic, and thus can be formed to produce practical devices.
The devices are semi-classical, semi-quantum, in that interactions within a cell can involve quantum tunneling, but interactions between cells involve only classical electromagnetic effects.

HTML version of Basic Foils prepared 25 August 1997

Foil 18 Ames Summer 97 Workshop on Device Technology -- QCA -II

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Like RSFQ, initial demonstration devices of QCA logic will need to be super-cooled, assuming fabrication feature sizes in the tens of nanometers.
But unlike RSFQ (from what we know of RSFQ), the smaller the fabrication dimension, the higher the tolerable temperature, according to thermodynamic analyses.
Indeed, if one could eventually fabricate QCA devices in the realm of 2-5 nanometers, then they could operate at room temperature!

HTML version of Basic Foils prepared 25 August 1997

Foil 19 Some Comments on Simulation and HPCC

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
HPCC is a maturing field with many organizations installing large scale systems
These include NSF (academic computations), DoE (Dept of Energy) and DoD (Defense)
There are new applications with new algorithmic challenges
  • Our work on Black Holes has novel adaptive meshes
  • On earthquake simulation, new "fast multipole" approaches to a problem not tackled this way before
  • On financial modeling, new Monte Carlo methods for complex options
However these are not "qualitatively" new concepts
Software ideas are "sound" but note simulation is only a few percent of total computer market
  • Use where possible software from other sources (such as web)which can spend more money on each feature!

HTML version of Basic Foils prepared 25 August 1997

Foil 20 Parallel Computing Rationale

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Transistors are getting cheaper and cheaper and it only takes some 0.5 million transistors to make a very high quality CPU
  • Essentially impossible to increase clock speed and so must exploit increasing transistor density in figure of merit (1/f)2-4
Already we build chips with some factor of ten more transistors than this and this is used for "automatic" instruction level parallelism.
  • This corresponds to parallelism in "innermost loops"
However getting much more speedup than this requires use of "outer loop" or data parallelism.
Actually memory bandwidth is an essential problem in any computer as doing more computations per second requires accessing more memory cells per second!
  • Harder for sequential than parallel computers
  • Data locality is unifying concept!

HTML version of Basic Foils prepared 25 August 1997

Foil 21 Sequential Memory Structure

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Data locality implies CPU finds information it needs in cache which stores most recently accessed information
This means one reuses a given memory reference in many nearby computations e.g.
A1 = B*C
A2 = B*D + B*B
.... Reuses B
L3 Cache
Main
Memory
Disk
Increasing Memory
Capacity Decreasing
Memory Speed (factor of 100 difference between processor
and main memory
speed)

HTML version of Basic Foils prepared 25 August 1997

Foil 22 Parallel Computer Memory Structure

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
For both parallel and sequential computers, cost is accessing remote memories with some form of "communication"
Data locality addresses in both cases
Differences are quantitative size of effect and what is done by user and what automatically

HTML version of Basic Foils prepared 25 August 1997

Foil 23 Back from the Analogy to Parallel Computers!

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Now we go through a set of semi-realistic examples of parallel computing and show they use various forms of data parallelism
Seismic Wave Simulation: Regular mesh of grid points
Cosmology (Universe) Simulation: Irregular collection of stars or galaxies
Computer Chess: "Data" is now parts of a game tree with major complication of pruning algorithms
  • "Deep Blue" uses this global parallelism combined with optimal locally parallel evaluation of "score" of a position using special hardware
Defending the Nation: Tracking multiple missiles achieves parallelism from set of missiles
Finite Element (NASTRAN) structures problem: Irregular collection of nodal (grid) points clustered near a crack

HTML version of Basic Foils prepared 25 August 1997

Foil 24 Introduction to Computer Architectures CPS615 Fall Semester 97

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
http://www.npac.syr.edu/users/gcf/cps615arch97
Geoffrey Fox
Syracuse University NPAC
111 College Place Syracuse NY 13244 4100
3154432163

HTML version of Basic Foils prepared 25 August 1997

Foil 25 Abstract of CPS615 Architecture Discussion

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
We describe the simple nature of parallel machines like the nCUBE2
Technologies include conventional, superconducting and Quantum systems but the former dominates
Sequential architectures are dominated by memory (cache) issues
Vector, SIMD MIMD are classic architectures
There is a growing interest in distributed metacomputers
Special purpose machines are typically not so successful
The details of networks is not as important as it used to be!

HTML version of Basic Foils prepared 25 August 1997

Foil 26 What is a Pipeline -- Cafeteria Analogy?

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Familiar from such everyday activities as getting food in cafeteria where one processes one person per "clock cycle" where
clock cycle here is maximum time anybody takes at a single "stage" where stage is here component of meal (salad, entrée etc.)
Note any one person takes about 5 clock cycles in this pipeline but the pipeline processes one person per clock cycle
Pipeline has problem if there is a "stall" -- here that one of the people wants an entrée which needs to be fetched from kitchen. This delays everybody!
In computer case, stall is caused typically by data not being ready for a particular instruction

HTML version of Basic Foils prepared 25 August 1997

Foil 27 Example of MIPS R4000 Floating Point

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Taken from David Patterson CS252 Berkeley Fall 1996
3 Functional Units FP Adder, FP Multiplier, FP Divider
8 Kinds of Stages in FP Units
Stage Functional Unit Description
A FP Adder Mantissa ADD stage
D FP Divider Divide Pipeline stage
E FP Multiplier Exception Test stage
M FP Multiplier First stage of multiplier
N FP Multiplier Second stage of multiplier
R FP Adder Rounding stage
S FP Adder Operand Shift stage
U Unpack Floating point numbers

HTML version of Basic Foils prepared 25 August 1997

Foil 28 MIPS R4000 Floating Point Stages

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Several different pipelines with different lengths!
Add,Subtract- 4 clocks:U S+A A+R R+S
Multiply - 8 clocks: U E+M M M M N+A R
Divide- 38 clocks: U A R D(28) D+A D+R D+R D+A D+R A R
Square Root- 110 clocks: U E (A+R)(108) A R
Negate- 2 clocks: U S
Absolute Value- 2 clocks: U S
Floating Point Compare- 3 clocks:U A R

HTML version of Basic Foils prepared 25 August 1997

Foil 29 Cache Coherent or Not?

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Suppose two processors cache the same variable stored in memory of one of the processors
One must ensure cache coherence so that when one cache value changes, all do!
Interconnection Network
....
....
Interconnection Network
Cached Value of same shared variable

HTML version of Basic Foils prepared 25 August 1997

Foil 30 Choices in Cache Coherence

From Master Set of Foils for 1997 Session of CPS615 Basic Simulation Track for Computational Science CPS615 -- Fall Semester 97. *
Full HTML Index
Machines like the SGI Origin 2000 have a distributed shared memory with a so called directory implementation (pioneered in DASH project at Stanford) of cache coherence
Machines like SGI Cray T3E are distributed memory but do have fast get and put so as to be able to access single variables stored on remote memories
  • This gives performance advantage of shared memory without the programming advantage but complex hardware of Origin 2000
Origin 2000 approach does not scale as well as Cray T3E and large Origin 2000 systems must use message passing to link 128 node coherent memory subsystems
Cray T3E offers a uniform (but not as good for small number of nodes) interface
Message passing / distributed memory is natural web model

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Mon Sep 22 1997