CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Introduction to Driving Technology and HPCC
Current Status and Futures
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of The Current Status and Futures of HPCC 
	Overview of Course Itself! -- and then introductory material on basic curricula
	Overview of National Program -- The Grand Challenges
	Overview of Technology Trends leading to petaflop performance in year 2007 (hopefully)
	Overview of Syracuse and National programs in computational science
	Parallel Computing in Society
	Why Parallel Computing works
	Simple Overview of Computer Architectures
		SIMD MIMD Distributed (shared memory) Systems ... PIM ... Quantum Computing
	General Discussion of Message Passing and Data Parallel Programming Paradigms and a comparison of languages
Basic Course CPS615 Contact Points
	Instructor: Geoffrey Fox gcf@npac.syr.edu 3154432163 Room 3-131 CST
	Backup: Nancy McCracken njm@npac.syr.edu 3154434687 Room 3-234 CST
	NPAC Administrative support: Nora Downey-Easter nora@npac.syr.edu 3154431722 Room 3-206 CST
	CPS615 Powers that be above can be reached at cps615ad@npac.syr.edu
	CPS615 Students can be reached by mailing cps615@npac.syr.edu
	Homepage will be:
	http://www.npac.syr.edu/projects/cps615fall96
	See my paper SCCS 736 as an overview of HPCC status
Course Organization
	Graded on the basis of Approximately 8 Homeworks which will be due Thursday of week following day given out (Tuesday or Thursday)
	Plus one modest sized project at the end of class -- must involve "real" running parallel code!
	No finals or written exams
	All material will be placed on World Wide Web(WWW)
	Preference given to work returned on the Web
Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- I
	Overview of National Scene -- Why is High Performance Computing Important
		Grand Challenges
	What is Computational Science -- The Program at Syracuse
	Basic Technology Situation -- Increasing density of transistors on a chip
		Trends to year 2007 using Moore's Law (see UVC Video)
	Elementary Discussion of Parallel Computing including use in society
		why does parallel computing always "work" in principle
	Computer Architecture  -- Parallel and Sequential
		Network Interconnections, SIMD v. MIMD, Distributed Shared Memory
		vectorization contrasted with parallism
Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- II
	Simple base example -- Laplace's Equation
		How does parallel computing work
	This is followed by two sections -- software technologies and applications which are interspersed with each other and "algorithm" modules
	Programming Models -- Message Passing and Data Parallel Computing -- MPI and HPF (Fortran 90)
		Some remarks on parallel compilers
		Remarks on use of parallel Java
	Some real applications analysed in detail
		Chemistry, CFD, Earthquake prediction, Statistical Physics
Basic Structure of Complete CPS615 Base Course on Computational Science Simulation Track -- III
	This introduction is followed by a set of "vignettes" discussing problem classes which illustrate parallel programming and parallel algorithms
	Ordinary Differential Equations
		N body Problem by both O(N^2) and "fast multipole" O(N) method
	Numerical Integration including adaptive methods
	Floating Point Arithmetic
	Monte Carlo Methods including Random Numbers
	Full Matrix Algebra as in
		Computational Electromagnetism 
		Computational Chemistry
	Partial Differential Equations implemented as sparse matrix problems (as in Computational Fluid Dynamics)
		Iterative Algorithms from Gauss Seidel to Conjugate Gradient
		Finite Element Methods
Three Major Markets -- Logic,ASIC,DRAM
	Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994
	L=Logic, D=DRAM, A=ASIC, mP = microprocessor
Chip and Package Characteristics
	Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994
Fabrication Characteristics
	Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994
Electrical Design and Test Metrics
	Overall Roadmap Technology Characteristics from SIA (Semiconductor Industry Association) Report 1994
Technologies for High Performance Computers
	We can choose technology and architecture separately in designing our high performance system
	Technology is like choosing ants people or tanks as basic units in our society analogy
		or less frivolously neurons or brains
	In HPCC arena, we can distinguish current technologies
		COTS (Consumer off the shelf) Microprocessors
		Custom node computer architectures
		More generally these are all CMOS technologies
	Near term technology choices include
		Gallium Arsenide or Superconducting materials as opposed to Silicon
		These are faster by a factor of 2 (GaAs) to 300 (Superconducting)
	Further term technology choices include
		DNA (Chemical) or Quantum technologies
	It will cost $40 Billion for next industry investment in CMOS plants and this huge investment makes it hard for new technologies to "break in"
Architectures for High Performance Computers - I
	Architecture is equivalent to organization or design in society analogy
		Different models for society (Capitalism etc.) or different types of groupings in a given society
		Businesses or Armies are more precisely controlled/organized than a crowd at the State Fair 
		We will generalize this to formal (army) and informal (crowds) organizations
	We can distinguish formal and informal parallel computers
	Informal parallel computers are typically "metacomputers"
		i.e. a bunch of computers sitting on a department network
Architectures for High Performance Computers - II
	Metacomputers are a very important trend which uses similar software and algorithms to conventional "MPP's" but have typically less optimized parameters
		In particular network latency is higher and bandwidth is lower for an informal HPC
		Latency is time for zero length communication -- start up time
	Formal high performance computers are the classic (basic) object of study and are
	"closely coupled" specially designed collections of compute nodes which have (in principle) been carefully optimized and balanced in the areas of
		Processor (computer) nodes
		Communication (internal) Network
		Linkage of Memory and Processors
		I/O (external network) capabilities
		Overall Control or Synchronization Structure
There is no Best Machine!
	In society, we see a rich set of technologies and architectures
		Ant Hills
		Brains as bunch of neurons
		Cities as informal bunch of people
		Armies as formal collections of people
	With several different communication mechanisms with different trade-offs
		One can walk -- low latency, low bandwidth
		Go by car -- high latency (especially if can't park), reasonable bandwidth
		Go by air -- higher latency and bandwidth than car
		Phone -- High speed at long distance but can only communicate modest material (low capacity)
Quantum Computing - I
	Quantum-Mechanical Computers by Seth Lloyd, Scientific American, Oct 95
	Chapter 6 of The Feynman Lectures on Computation edited by Tony Hey and Robin Allen, Addison-Wesley, 1996
	Quantum Computing: Dream or Nightmare? Haroche and Raimond, Physics Today, August 96 page 51
	Basically any physical system can "compute" as one "just" needs a system that gives answers that depend on inputs and all physical systems have this property
	Thus one can build "superconducting" "DNA" or "Quantum" computers exploiting respectively superconducting molecular or quantum mechanical rules
Quantum Computing - II
	For a "new technology" computer to be useful, one needs to be able to 
		conveniently prepare inputs, 
		conveniently program, 
		reliably produce answer (quicker than other techniques), and
		conveniently read out answer
	Conventional computers are built around bit ( taking values 0 or 1) manipulation
	One can build arbitarily complex arithmetic if have some way of implementing NOT and AND
	Quantum Systems naturally represent bits
		A spin (of say an electron or proton) is either up or down 
		A hydrogen atom is either in lowest or (first) excited state etc.
Quantum Computing - III
	Interactions between quantum systems can cause "spin-flips" or state transitions and so implement arithmetic
	Incident photons can "read" state of system and so give I/O capabilities
	Quantum "bits" called qubits have another property as one has not only
		State |0> and state |1> but also
		Coherent states such as .7071*(|0> + |1>) which are equally in either state
	Lloyd describes how such coherent states provide new types of computing capabilities
		Natural random number as measuring state of qubit gives answer 0 or 1 randomly with equal probability
		As Feynman suggests, qubit based computers are natural for large scale simulation of quantum physical systems -- this is "just" analog computing
Superconducting Technology -- Past
	Superconductors produce wonderful "wires" which transmit picosecond (10^-12 seconds) pulses at near speed of light
		Superconducting is lower power and faster than diffusive electron transmission in CMOS
		At about 0.35micron chip feature size, CMOS transmission time changes from domination by transmission (Distance) issues to resistive (diffusive effects)
	Niobium used in constructing such superconducting circuits can be processed by similar fabrication techniques to CMOS
	Josephson Junctions allow picosecond performance switches
	BUT IBM (!969-1983) and Japan (MITI 1981-90) terminated major efforts in this area
Superconducting Technology -- Present
	New ideas have resurrected this concept using RSFQ -- Rapid Single Flux Quantum -- approach 
	This naturally gives a bit which is 0 or 1 (or in fact n units!)
	This gives interesting circuits of similar structure to CMOS systems but with a clock speed of order 100-300GHz -- factor of 100 better than CMOS which will asymptote at around 1 GHz (= one nanosecond cycle time)
Superconducting Technology -- Problems
	At least two major problems:
	Semiconductor industry will invest some some $40B in CMOS "plants" and infrastructure
		Currently perhaps $100M a year going into superconducting circuit area!
		How do we "bootstrap" superconducting industry?
	Cannot build memory to match CPU speed and current designs have superconducting CPU's (with perhaps 256 Kbytes superconducting memory per processor) but conventional CMOS memory
		So compared with current computers have a thousand times faster CPU, factor of four smaller cache of CPU speed  and same speed basic memory as now
		Can such machines perform well -- need new algorithms?
		Can one design new superconducting memories?
	Superconducting technology also has a bad "name" due to IBM termination!
Architecture Classes of High Performance Computers
	Sequential or von Neuman Architecture
	Vector (Super)computers
	Parallel Computers
		with various architectures classified by Flynn's methodology (this is incomplete as only discusses control or synchronization structure )
		SISD
		MISD
		MIMD
		SIMD
		Metacomputers
von Neuman Architecture in a Nutshell
	Instructions and data are stored in the same memory for which there is a single link (the von Neumann Bottleneck) to the CPU which decodes and executues instructions
	The CPU can have multiple functional units
	The memory access can be enhanced by use of caches made from faster memory to allow greater bandwidth and lower latency
Illustration of Importance of Cache
	Fig 1.14 of Aspects of Computational Science 
	Editor Aad van der Steen 
	published by NCF
Vector Supercomputers in a Nutshell - I
	This design enhances performance by noting that many applications calculate "vector-like" operations
		Such as c(i)=a(i)+b(i) for i=1...N and N quite large
	This allows one to address two performance problems
		Latency in accessing memory (e.g. could take 10-20 clock cycles between requesting a particular memory location and delivery of result to CPU)
		A complex operation , e.g. a floating point operation, can take a few machine cycles to complete
	They are typified by Cray 1, XMP, YMP, C-90, CDC-205, ETA-10 and Japaneses Supercomputers from NEC Fujitsu and Hitachi
Vector Supercomputing in a picture
	A pipeline for vector addition looks like:
		From Aspects of Computational Science -- Editor Aad van der Steen published by NCF
Vector Supercomputers in a Nutshell - II
	Vector machines pipeline data through the CPU
	They are not so popular/relevant as in the past as
		Improved C.P.U. architecture needs fewer cycles than before for each (complex) operation (e.g 4 now not ~100 as in past)
		8 Mhz 8087 of Cosmic Cube took 160 to 400 clock cycles to do a full floating point operation in 1983
		Applications need more flexible pipelines which allow different operations to be executed on consequitive operands as they stream through CPU
		Modern RISC processors (super scalar) can support such complex pipelines as they have far more logic than CPU's of the past
	In fact excellence of say, Cray C-90 is due to its very good memory architecture allowing one to get enough operands to sustain pipeline. 
	Most workstation class machines have "good" CPU's but can never get enough data from memory to sustain good performance except for a few cache intensive applications
Flynn's Classification of HPC Systems
	Very high speed computing systems,Proc of IEEE 54,12,p1901-1909(1966) and
	Some Computer Organizations and their Effectiveness, IEEE Trans. on Comp. C-21,948-960(1972) -- both papers by M.J. Flynn
	SISD -- Single Instruction stream, Single Data Stream -- i.e. von Neumann Architecture
	MISD -- Multiple Instruction stream, Single Data Stream -- Not interesting
	SIMD -- Single Instruction stream, Multiple Data Stream
	MIMD -- Multiple Instruction stream and Multiple Data Stream -- dominant parallel system with ~one to ~one match of instruction and data streams. 
Parallel Computer Architecture Memory Structure
	Memory Structure of Parallel Machines
		Distributed
		Shared
		Cached
	and Heterogeneous mixtures
	Shared (Global): There is a global memory space, accessible by all processors.
		Processors may also have some local memory. 
		Algorithms may use global data structures efficiently.
		 However "distributed memory" algorithms may still be important as memory is NUMA (Nonuniform access times)
	Distributed (Local, Message-Passing): All memory is associated with processors. 
		To retrieve information from another processors' memory a message must be sent there. 
		Algorithms should use distributed data structures.
Comparison of Memory Access Strategies
	Memory can be accessed directly (analogous to a phone call) as in red lines below or indirectly by message passing (green line below)
	We show two processors in a MIMD machine for distributed (left) or shared(right) memory architectures
Types of Parallel Memory Architectures -- Physical Characteristics
	Uniform: All processors take the same time to reach all memory locations.
	Nonuniform (NUMA): Memory access is not uniform so that it takes a different time to get data by a given processor from each memory bank. This is natural for distributed memory machines but also true in most modern shared memory machines
		DASH (Hennessey at Stanford) is best known example of such a virtual shared memory machine which is logically shared but physically distributed.
		ALEWIFE from MIT is a similar project
		TERA (from Burton Smith) is Uniform memory access and logically shared memory machine
	Most NUMA machines these days have two memory access times
		Local memory (divided in registers caches etc) and
		Nonlocal memory with little or no difference in access time for different nonlocal memories
	This simple two level memory access model gets more complicated in proposed 10 year out "petaflop" designs
Diagrams of Shared and Distributed Memories
	 
Parallel Computer Architecture Control Structure
	
	SIMD -lockstep synchronization
		Each processor executes same instruction stream
	MIMD - Each Processor executes independent instruction streams
	MIMD Synchronization can take several forms
		Simplest: program controlled message passing
		"Flags" (barriers,semaphores) in memory - typical shared memory construct as in locks seen in Java Threads
		Special hardware - as in cache and its coherency (coordination between nodes)
Some Major Hardware Architectures - MIMD
	MIMD Distributed Memory
		This is now best illustrated by a collection of computers on a network (i.e. a metacomputer)
	MIMD with logically shared memory but usually physically distributed. The latter is sometimes called distributed shared memory.
		In near future, ALL formal (closely coupled) MPP's will be distributed shared memory
		Note all computers (e.g. current MIMD distributed memory IBM SP2) allow any node to get at any memory but this is done indirectly -- you send a message
		In future "closely-coupled" machines, there will be built in hardware supporting the function that any node can directly address all memory of the system
		This distributed shared memory architecture is currently of great interest to (a major challenge for) parallel compilers
MIMD Distributed Memory Architecture
	A special case of this is a network of workstations (NOW's) or personal computers (metacomputer)
	Issues include:
		Node - CPU, Memory
		Network - Bandwidth, Memory
		Hardware Enhanced Access to distributed Memory
Some Major Hardware Architectures - SIMD
	SIMD -- Single Instruction Multiple Data -- can have logically distributed or shared memory
		Examples are CM-1,2 from Thinking Machines
		and AMT DAP and Maspar which are currently focussed entirely on accelerating parts of database indexing
		This architecture is of decreasing interest as has reduced functionality without significant cost advantage compared to MIMD machines
		Cost of synchronization in MIMD machines is not high!
		Main interest of SIMD is flexible bit arithmetic as processors "small" but as transistor densities get higher this also becomes less interesting as full function 64 bit CPU's only use a small fraction of silicon of modern computer
SIMD (Single Instruction Multiple Data) Architecture
	CM2 - 64 K processors with 1 bit arithmetic - hypercube network, broadcast network can also combine , "global or" network
	Maspar, DECmpp - 16 K processors with 4 bit (MP-1), 32 bit (MP-2) 	arithmetic, fast two-dimensional mesh and slower general switch for communication
Some Major Hardware Architectures - Mixed
	Also have heterogeneous compound architecture (metacomputer) gotten by arbitrary combination of MIMD or SIMD, Sequential or Parallel machines.
	Metacomputers can vary from full collections of several hundred PC's/Settop boxes on the (future) World Wide Web to a CRAY C-90 connected to a CRAY T3D
	This is a critical future architecture which is intrinsically distributed memory as multi-vendor heterogenity implies that one cannot have special hardware enhanced shared memory
		note that this can be a MIMD collection of SIMD machines if have a set of Maspar's on a network
		One can think of human brain as a SIMD machine and then a group of people is such a MIMD collection of SIMD processors 
Some MetaComputer Systems
	Cluster of workstations or PC's
	
	
	
	Heterogeneous MetaComputer System
Comments on Special Purpose Devices
	One example is an Associative memory - SIMD or MIMD or content addressable memories 
	This is an an example of a special purpose "signal" processing machine which can in fact be built from "conventional" SIMD or "MIMD" architectures
	This type of machine is not so popular as most applications are not dominated by computations for which good special purpose devices can be designed
	If only 10% of a problem is say "track-finding" or some special purpose processing, then who cares if you reduce that 10% by a factor of 100
		You have only sped up the system by a factor 1.1 not by 100!
The GRAPE N-Body Machine
	N body problems (e.g. Newton's laws for one million stars in a globular cluster) can have succesful special purpose devices
	See GRAPE (GRAvity PipE) machine (Sugimoto et al. Nature 345 page 90,1990)
		Essential reason is that such problems need much less memory per floating point unit than most problems 
		Globular Cluster: 10^6 computations per datum stored
		Finite Element Iteration: A few computations per datum stored
		Rule of thumb is that one needs one gigabyte of memory per gigaflop of computation in general problems and this general design puts most cost into memory not into CPU.
	Note GRAPE uses EXACTLY same parallel algorithm that one finds in the books (e.g. Solving Problems on Concurrent Processors) for N-body problems on classic distributed memory MIMD machines
Why isn't GRAPE a Perfect Solution?
	GRAPE will execute the classic O(N^2) (parallel) N body algorithm BUT this is not the algorithm used in most such computations
	Rather there is the O(N) or O(N)logN  so called "fast-multipole" algorithm which uses hierarchical approach
		On one million stars, fast multipole is a factor of 100-1000 faster than GRAPE algorithm
		fast multipole works in most but not all N-body problems (in globular clusters, extreme heterogenity makes direct O(N^2) method most attractive) 
	So special purpose devices cannot usually take advantage of new nifty algorithms!
Granularity of Parallel Components - I
	Coarse-grain: Task is broken into a handful of pieces, each executed by powerful processors. 
		Pieces, processors may be heterogeneous. Computation/ 
		Communication ratio very high -- Typical of Networked Metacomputing
	Medium-grain: Tens to few thousands of pieces, typically executed by microprocessors. 
		Processors typically run the same code.(SPMD Style) 
		Computation/communication ratio often hundreds or more. 
		Typical of MIMD Parallel Systems such as SP2, CM5, Paragon, T3D
Granularity of Parallel Components - II
	Fine-grain: Thousands to perhaps millions of small pieces, executed by very small, simple processors (several per chip) or through pipelines. 
		Processors often have instructions broadcasted to them.
		Computation/ Communication ratio often near unity. 
		Typical of SIMD but seen in a few MIMD systems such as Kogge's Execube, Dally's J Machine or commercial Myrianet (Seitz)
		This is going to be very important in future petaflop architectures as the dense chips of year 2003 onwards favor this Processor in Memory Architecture
		So many "transistors" in future chips that "small processors" of the "future" will be similar to todays high end microprocessors
		As chips get denser, not realistic to put processors and memories on separate chips as granularities become too big
	Note that a machine of given granularity can be used on algorithms of the same or finer granularity
Classes of Communication Networks
	The last major architectural feature of a parallel machine is the network or design of hardware/software connecting processors and memories together.
	Bus: All processors (and memory) connected to a common bus or busses.
		Memory access fairly uniform, but not very scalable due to contention
		Bus machines can be NUMA if memory consists of directly accessed local memory as well as memory banks accessed by Bus. The Bus accessed memories can be local memories on other processors
	Switching Network: Processors (and memory) connected to routing switches like in telephone system.
		Switches might have queues and "combining logic", which improve functionality but increase latency.
		Switch settings may be determined by message headers or preset by controller.
		Connections can be packet-switched (messages no longer than some fixed size) or circuit-switched (connection remains as long as needed)
		Usually NUMA, blocking, often scalable and upgradable
Switch and Bus based Architectures
	Switch
	
	
	
	
	
	
	
	
	
	
	
	
	
	Bus
Examples of Interconnection Topologies
	Two dimensional grid, Binary tree, complete interconnect and 4D Hypercube.
	Communication (operating system) software ensures that systems appears fully connected even if physical connections partial
Useful Concepts in Communication Systems
	Useful terms include:
	Scalability: Can network be extended to very large systems? Related to wire length (synchronization and driving problems), degree (pinout)
	Fault Tolerance: How easily can system bypass faulty processor, memory, switch, or link?  How much of system is lost by fault?
	Blocking: Some communication requests may not get through, due to conflicts caused by other requests.
	Nonblocking: All communication requests succeed. Sometimes just applies as long as no two requests are for same memory cell or processor.
	Latency (delay): Maximal time for nonblocked request to be transmitted.
	Bandwidth: Maximal total rate (MB/sec) of system communication, or subsystem-to-subsystem communication. Sometimes determined by cutsets, which cut all communication between subsystems. Often useful in providing lower bounds on time needed for task.
	Worm Hole Routing -- Intermediate switch nodes do not wait for full message but allow it to pass throuch in small packets
Communication Performance of Some MPP's
	From  Aspects of Computational Science, Editor Aad van der Steen, published by NCF
	System    Communication Speed   Computation Speed
	                 Mbytes/sec(per link)        Mflops/sec(per node) 
	IBM SP2             40                                         267
	Intel iPSC860     2.8                                         60
	Intel Paragon     200                                        75
	Kendall Square
	KSR-1                 17.1                                       40
	Meiko CS-2         100                                       200
	Parsytec GC       20                                         25
	TMC CM-5           20                                        128
	Cray T3D             150                                      300
Implication of Hardware Performance
	tcomm = 4 or 8 /Speed in Mbytes sec
		as 4 or 8 bytes in a floating point word
	tfloat = 1/Speed in Mflops per sec
	Thus tcomm  / tfloat is just 4 X Computation Speed divided by Communication speed
	tcomm  / tfloat is 26.7, 85, 1.5, 9.35, 8, 5, 25.6, 8 for the machines SP2, iPSC860, Paragon, KSR-1, Meiko CS2, Parsytec GC, TMC CM5, and Cray T3D respectively
	Latency makes situation worse for small messages and double for 64bit arithmetic natural on large problems!
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 5 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Sept 5 1996 CPS615 Lecture
	This starts by considering the analytic form for communication overhead and illustrates its stencil dependence in simple local cases -- stressing relevance of grain size
	The implication for scaling and generalizing from Laplace example is covered
		We covered scaled speedup (fixed grain size) as well as fixed problem size
	We noted some useful material was missing and this was continued in next lecture (Sept 10,96)
	The lecture starts coverage of computer architecture covering base technologies with both CMOS covered in an earlier lecture contrasted to Quantum and Superconducting technology
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 10 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Sept 10 1996 CPS615 Lecture
	This starts by filling in details of communication overhead in parallel processing for case where "range" of interaction is large
	We show two old examples from Caltech illustrates correctness of analytic form
	We return to discussion of computer architectures describing
		Vector Supercomputers
		General Relevance of data locality and pipelining
		Flynn's classification (MIMD,SIMD etc.)
		Memory Structures
		Initial issues in MIMD and SIMD discussion
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 12 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Sept 12 1996 CPS615 Lecture
	This continues the computer architecture discussion with
		MIMD and SIMD with distributed shared memory
		MetaComputers
		Special Purpose Architectures
		Granularity with technological changes forcing larger process sizes
		Overview of Communication Networks with
			Switches versus topologies versus buses
			Typical values in today's machines
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 24 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Sept 24 1996 CPS615 Lecture
	This continues the discussion of Fortran 90 with a set of overview remarks on each of the key new capabilities of this language
	We also comment on value of Fortran90/HPF in a world that will switch to Java
	We disgress to discuss a general theory of problem architectures as this explains such things as tradeoffs
		HPCC v Software Engineering
		HPF versus MPI
	And the types of applications each software model is designed to address
	(Note Server errors at start which confuses audio)
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of September 26 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Sept 26 1996 CPS615 Lecture
	This quickly completes the discussion of problem architecture but rather than continuing qualitative discussion of HPF applications in notes
	Jumped to a discussion of HPF language describing
	Basic Approach to Parallelism with "owner-computes" rule
	Types of new constructs with
	TEMPLATE ALIGN and PROCESSORS described
	The lecture started with a description of the Web based Programming Laboratory developed by Kivanc Dincer
Embarassingly Parallel Problem Class
	Loosely Synchronous, Synchronous or Asynchronous classify problems by their "control" or "synchronization" structure
	However there is an important class of problems where this does not matter as the synchronization overhead -- even if in the difficult asynchronous case -- is irrelevant
	This is when overhead small or zero. These are "embarassingly parallel problems" where each component of decomposed problem is essentially independent
	Examples are:
		Financial Modelling where each component is calculating some expected value for a particular (possibly Monte Carlo) set of assumptions about the future
		OLTP (Online Transaction Processing) where each component is a separate checking of a credit card transaction against the account data
		Graphics rendering where each component is the calculation of the color of a particular pixel.
CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
HPCC Software Technologies
HPF and MPI
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of CPS615 HPCC Software Technologies
	We go through the 2D Laplace's Equation with both HPF and MPI for Simple Jacobi Iteration 
	HPF and Fortran90 are reviewed followed by MPI
	We also discuss the structure of problems as these determine why and when certain software approaches are appropriate
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 1 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Oct 1 1996 CPS615 Lecture
	This continues the discussion of HPF in the area of distribution and ALIGN statements.
	The discussion of ALIGN should be improved as audio makes dubious statements about "broadcasting" information.
	The distribution discussion includes a reasonable descriuption of block and cyclic and when you should use them.
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 3 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 10 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 15 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 22 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 24 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of October 31 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of November 7 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of November 8 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Oct 24 1996 CPS615 Lecture
	This covers two topics:
	Monte Carlo Integration for large scale Problems using Experimental and Theoretical high energy physics as an example
	This includes accept-reject methods, uniform weighting and parallel algorithms
	Then we complete HPF discussion with embarassingly parallel DO INDEPENDENT discussed in Monte Carlo case
	And HPF2 Changes
Abstract of Oct 22 1996 CPS615 Lecture
	This starts by finishing the simple overview of statistics
	Covering Gaussian Random Numbers, Numerical Generation of Random Numbers both sequentially and in parallel
	Then we describe the central limit theorem which underlies Monte Carlo method
	Then it returns to Numerical Integration with the first part of discussion of Monte Carlo Integration
Abstract of Oct 15 1996 CPS615 Lecture
	This finishes the last part of N body and ODE discussion fociussing on pipeline data parallel algorithm
	Note several foils were changed after presentation and so discussion is a little disconnected from foils at times
	We start Numerical Integration with a basic discussion of Newton-Cotes formulae (including Trapezoidal and Simpson's rule)
	We illustrate them pictorially
Abstract of Nov 7 1996 CPS615 Lecture
	This completes the MPI general discussion with the basic message passing, collective communication and some advanced features
	It then returns to Laplace Example foilset to show how MPI can be used here
		We have previously used this for HPF and performance analysis
Abstract of Nov 8 1996 CPS615 Lecture
	This starts basic module on Partial Differential Solvers with
	Introduction to Classification of Equations
	Basic Discretization
	Derivation of Sparse Matrix Formulation
	Analogies of Iteration with Artificial Time
	Start of Explicit Matrix Formulation for Simple Cases
Abstract of Oct 3 1996 CPS615 Lecture
	Tom Haupt on Fortran 90 / HPF
Abstract of Oct 10 1996 CPS615 Lecture
	This discusses solution of systems of ordinary differential equations (ODE) in the context of N squared particle dynamics problems 
	We start with motivation with brief discussion of relevant problems
	We go through basic theory of ODE's including Euler, Runge-Kutta, Predictor-Corrector and Multistep Methods
	We begin the discussion of solving N body problems using classic Connection Machine elegant but inefficient algorithm
	Note -- Some foils expanded to two after talk and second one is given without audio in script
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of November 14 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of November 26 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Delivered Lectures for CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1996 --
Lecture of December 5 - 1996
	Geoffrey Fox
	NPAC
	Room 3-131 CST
	111 College Place
	Syracuse NY 13244-4100
Abstract of Nov 14 1996 CPS615 Lecture
	This started with a description of current Web set-up of CPS615 and other foilsets 
	Then we started the foilset describing Physical Simulations and the various approaches -- Continuum Physics, Monte Carlo, Quantum Dynamics, and Computational Fluid Dynamics
	For CFD we do enough to discuss why viscosity and High Reynolds numbers are critical in air and similar media
	We discuss computation and communication needs of CFD compared to Laplace equation
Abstract of Nov 26 1996 CPS615 Lecture
	This covers essentially all the finite element method and its solution using the conjugate gradient method
	Only the simple 2D Laplace equation using triangular nodes is discussed
	We stress variational method as an optimization method and you use this analogy to motivate conjugate gradient as an improved steepest descent approach
	We discuss parallel computing issues for both finite element and conjugate gradient   
Abstract of Dec 5 1996 CPS615 Lecture
	This lecture covers two distinct areas. 
	Firstly a short discussion of LInear Programming -- what type of problems its used for, what the equations  look like and basic issues in the difficult use of parallel processing
	Then we give an abbreviated discussion of Full Matrix algorithms  covering
		The types of applications that use them
		Matrix Multiplication including Cannon's algorithm in detail
		Use of MPI primitives including communicator groups
		Performance Analysis