Fox Presentation Spring 96
	A Short Overview of HPCC
	From GigaFlops to PetaFlops
	From Tightly Coupled MPP's to the World Wide Web
		http://www.npac.syr.edu/users/gcf/cornellhpcc96/index.html
		CRPC MRA ReTooling Project
		Cornell May 7,96
		Geoffrey Fox
		NPAC
		Syracuse University
		111 College Place
		Syracuse NY 13244-4100
	Abstract of a Short Overview of HPCC
		See SCCS-736 for overview of Parallel and SCCS 750 for distributed computing
		We discuss current and near future architectures as well the yet different trends expected 10 years from now
		COTS (see SCCS 758 and 732 for Web Software) philosophy dominates both hardware and software as success demands that niche applications leverage bigger fields
		Latency Tolerance will be an essential feature of future algorithm and software
		Data Parallelism is essential for success on large machines but current compilers are struggling
		Coordination or Integration software is thriving
	Some Important Trends -- COTS is King!
		Everybody now believes in COTS -- Consumer On the Shelf Technology -- one must use commercial building blocks for any specialized system whether it be a DoD weopens program or high end Supercomputer 
			These are both Niche Markets!
		COTS for hardware can be applied to a greater or less extent
			Gordon Bell's SNAP system says we will only have ATM networks of PC's running WindowsNT
			SGI HP and IBM will take commodity processor nodes but link with custom switches (with different versions of distributed shared memory support)
		COTS for Software is less common but (I expect) to become much more common
			HPF producing HTTP not MPI with Java visulalization is an example of Software COTS
	Comments on COTS for Hardware
		Currently MPP's have COTS processors and specialized networks but this could reverse
			Pervasive ATM will indeed lead to COTS Networks BUT
			Current microprocessors are roughly near optimal in terms of megaflops per square meter of silicon BUT
			As (explicit) parallelism shunned by modern microprocessor, silicon is used for wasteful speculative execution with expectation that future systems will move to 8 way functional parallelism.
		Thus estimate that 250,000 transistors (excluding on chip cache) is optimal for performance per square mm of silicon
			Modern microprocessor is around ten times this size
		Again simplicity is optimal but this requires parallelism
		Contrary trend is that memory dominates use of silicon and so performance per square mm of silicon is often not relevant
	Supercomputer Architectures in Years 2005-2010 -- I
		Conventional (Distributed Shared Memory) Silcon
			Clock Speed 1GHz
			4 eight way parallel Complex RISC nodes per chip
			4000 Processing chips gives over 100 tera(fl)ops
			8000 2 Gigabyte DRAM gives 16 Terabytes Memory
		Note Memory per Flop is much less than one to one
		Natural scaling says time steps decrease at same rate as spatial intervals and so memory needed goes like (FLOPS in Gigaflops)**.75 
			If One Gigaflop requires One Gigabyte of memory (Or is it one Teraflop that needs one Terabyte?)
	Supercomputer Architectures in Years 2005-2010 -- II
		Superconducting Technology is promising but can it compete with silicon juggernaut?
		Should be able to build a simple 200 Ghz Superconducting CPU with modest superconducting caches (around 32 Kilobytes)
		Must use same DRAM technology as for silicon CPU ?
		So tremendous challenge to build latency tolerant algorithms (as over a factor of 100 difference in CPU and memory speed) but advantage of factor 30-100 less parallelism needed
	Supercomputer Architectures in Years 2005-2010 -- III
		Processor in Memory (PIM) Architecture is follow on to J machine (MIT) Execube (IBM -- Peter Kogge) Mosaic (Seitz)
			More Interesting in 2007 as processors are be "real" and have nontrivial amount of memory
			Naturally fetch a complete row (column) of memory at each access - perhaps 1024 bits
		One could take in year 2007 each two gigabyte memory chip and alternatively build as a mosaic of
			One Gigabyte of Memory
			1000 250,000 transistor simple CPU's running at 1 Gigaflop each and each with one megabyte of on chip memory
		12000 chips (Same amount of Silicon as in first design but perhaps more power) gives:
			12 Terabytes of Memory
			12 Petaflops performance
			This design "extrapolates" specialized DSP's , the GRAPE (specialized teraflop N body machine) etc to a "somewhat specialized" system with a general CPU but a special memory poor architecture with particular 2/3D layout
	Comparison of Supercomputer Architectures
		Fixing 10-20 Terabytes of Memory, we can get
		16000 way parallel natural evolution of today's machines with various architectures from distributed shared memory to clustered heirarchy
			Peak Performance is 150 Teraflops with memory systems like today but worse with more levels of cache
		5000 way parallel Superconducting system with 1 Petaflop performance but terrible imbalance between CPU and memory speeds
		12 million way parallel PIM system with 12 petaflop performance and "distributed memory architecture" as off chip access with have serious penalities
		There are many hybrid and intermediate choices -- these are extreme examples of "pure" architectures
	Algorithm and Software Challenges -- The Latency Agenda!
		Current tightly coupled MPP's offer more or less uniform access to off processor memory with serious degradation
		Future systems will return to situation of 80's where both data locality and nearest neighbor access will be essential for good performance
		Tremendous reward for Latency Tolerant algorithms and software support for this
		Note we need exactly the same characteristics in MetaComputing today 
		World Wide Web is a good practice for Tomorrow's PetaFlop Machine!
	Returning to Today - I
		Tightly Coupled MPP's were (SP2,Paragon,CM5 etc) distributed memory but at least at the low end they are becoming hardware assisted shared memory
			Unclear how well compilers will support this in a scaling fashion -- we will see how SGI/Cray systems based ideas pioneered at Stanford fair!
		Note this is an example of COTS at work -- SGI/Sun/.. Symmetric Multiprocessors (Power Challenge from SGI) attractive as bus will support upto 16 processors in elegant shared memory software world.
			Previously such systems were not pwerful enough to be interesting
		Clustering such SGI Power Challenge like systems produces a powerful but difficult to program (as both distributed and shared memory) heterogeneous system
		Meanwhile Tera Computer will offer a true Uniform Memory access shared memory using ingenious multi threaded software/hardware to hide latency
			Unclear if this competitive in cost/performance with (scruffy) COTS approach
	Returning to Today - II
		Trend I -- Hardware Assisted Tightly Coupled Shared Memory MPP's are replacing pure distributed memory systems
		Trend II -- The World Wide Web and increasing power of individual workstations is making geographically distributed heterogeneous distributed memory systems more attractive
		Trend III -- To confuse the issue, the technology trends in next ten years suggest yet different architecture(s) such as PIM
		Better use Scalable Portable Software with conflicting technology/architecture trends BUT must address latency agenda which isn't clearly portable!
	Software Issues/Choices - I
		Functional Parallelism supported by sequential compiler intrachip in modern superscalar processor at very small grain size 
		Multithreading as in Java supports object or small grain size functional parallelism within a Java Applet -- this can be implemented with message passing not shared memory
		Data Parallelism is source of Most scaling parallelism and needed for more than a few way speedup --
			 HPF emerging slowly as hardware/technology/politics (grand versus national challenges) changes faster than one can write good complete compilers 
			Unfortunately most languages find it hard to express data parallelism in natural automatic way even though is "obvious in problem"
			Must use domain decomposition with target architecture dependent "objects" and so inevitably harder than pure object based parallelism where objects come from problem!
	The Sad Story of HPF and Some Applications
		The Good News -- In 1993, NPAC selected to Partner with two Grand Challenge Groups as part of their "computer science" support
		Both Applications were rather suitable for HPF as regular grids
		The Bad News -- Both Application groups have abandoned HPF as couldn't wait for working compilers with appropriate features 
			But BOTH have switched to using Fortran90!!
		Numerical Relativity needed adaptive mesh and this supplied by DAGH from Texas which is like HPF but simpler in overall capability but can handle adaptive meshes
			But need to wait for new features in DAGH to handle say conjugate gradient which is automatic in HPF
			Need to parallelize sophisticated data types (tensors) in Fortran90
		In Data Assimilation, a NASA Goddard application, they are recoding in Fortran90 plus MPI as MUST meet specific performance goals in a year
	Software Issues/Choices - II
		MetaProblems exhibit large grain size functional parallelism as in
			Multidisciplinary Optimization
		Coordination or Integration Software Addresses this.
		See SCCS 757 (available from online NPAC reports) which discusses applications and software approaches for all types of parallelism
		Java Applet Model is natural approach which can be illustrated by new VRML 2.0 which will build on
			Success of SIMNET/DSI -- Distributed Simulation for DoD
			with totally distributed set of Applets invoked by handlers (interrupts = active messages) loosely integrated over world wide Web -- a worldwide object parallelism
			e.g. Pressing door knob invokes Java applet as does missile hitting tank.
	Software Issues/Choices - III
		Architectures such as PIM emphasis opportunities if we could develop software models/compilers which could effectively use substantial parallelism
		Current partyline microprocessors assume that substantial parallelism cannot be easily extracted from programs designed for single Chips
		One area we are exploring is how to extract from/build in parallelism for Java
			We call this the HPJava project
		A major new direction in Computer Science is "Problem Solving Environments" which are domain specific systems at a "higher level" than compilers which are a "toolkit" of components such as:
			Linear Algebra Solvers
			Elliptic or Hyperbolic PDE Solvers
			Particular (adaptive) data structures such as those for multigrid
			Visualization etc.
	A Summary of Application Working Group Activities at PAWS'96
	including Point Group and Software Group Interactions
		
		http://www.npac.syr.edu/users/gcf/petasoft96/index.html
		Petasoft Meeting Bodega Bay
		June 17,1996
		Geoffrey Fox
		NPAC
		Syracuse University
		111 College Place
		Syracuse NY 13244-4100
	Application Participants Included
		Geoffrey Fox -- Syracuse University
		David Bailey -- NASA Ames
		Sid Karin -- UCSD
		Jean-Noel Leboeuf -- Oak Ridge National Laboratory
		Peter Lyster -- University of Maryland
		Andrea Malagoli -- University of Chicago
		Michael McGlaun -- Sandia National Laboratory
		John Salmon -- California Institute of Technology
		Rich Zippel -- Cornell
	Peak Supercomputer Performance
		For "Convential" MPP/Distributed Shared Memory Architecture
		Now(1996) Peak is 0.1 to 0.2 Teraflops in Production Centers
			Note both SGI and IBM are changing architectures:
			IBM Distributed Memory to Distributed Shared Memory
			SGI Shared Memory to Distributed Shared Memory
		In 1999, one will see production 1 Teraflop systems
		In 2003, one will see production 10 Teraflop Systems
		In 2007, one will see production 50-100 Teraflop Systems
		Memory is Roughly 0.25 to 1 Terabyte per  1 Teraflop
		If you are lucky/work hard: Realized performance is 30% of Peak