Computing on the Web
New Approaches to Parallel Processing
Petaop and Exaop Performance
in the Year 2007

January 1, 1997

Geoffrey C. Fox
gcf@npac.syr.edu

Wojtek Furmanski
furm@npac.syr.edu

http://www.npac.syr.edu

Northeast Parallel Architectures Center
111 College Place
Syracuse University
Syracuse, New York 13244-4100
Phone: (315)443-2163 Fax: (315)443-4741

Abstract

We look to the year 2007 when central parallel processors will have Petaop ( operations per second) performance, the deployed national information infrastructure or Web could have a distributed Exaop ( operations per second) and kids will come to university fluent in Java and intolerant of earlier approaches. These trends will offer new opportunities for high-performance simulations with more attractive programming environments.

I. Introduction

We focus on the year 2007 because by then we should have realized many of the dreams of the National Information Infrastructure (NII) and have widely deployed digital multimedia (video) information systems. We will assume (define) the software environment to be the (evolution of current) Web technology and assume (define) the national pervasive ATM/ISDN/ADSL/Satellite/... communication network to be the (evolution of the) internet. The year 2007 is also a convenient date for parallel processing as it is the target date for the current Federal study of the future of the field (http://www.aero.hq.nasa.gov/hpcc/petaflops/peta.html, http://www.npac.syr.edu/users/gcf/hpcc96petaflops). By that time, several interesting new architectures could be competitive with mainstream approaches and generally one can expect performance in the range of 0.1 to 10 Petaops for parallel supercomputers.

Recently, researchers at Sandia National Laboratory announced (http://www.intel.com/pressroom/archive/releases/cn121796.htm) sustained performances of over one Teraop ( operations per second) on a machine of the Department of Energy's ASCI (Accelerated Strategic Computing) (http://www.llnl.gov/asci/) program. This machine effectively involved linking together some 10,000 Intel PC C.P.U.'s each capable of over 100 Megaflops performance. Again, a fast ethernet network of 16 PC's achieved over one Gigaflops on a complex production astrophysics code ([Taubes:96], http://www.cacr.caltech.edu/research/beowulf/). These events highlight the possible role of commodity CPU's and networks and hence the Web or Internet as a computational resource. Today, the Web networks together many more computers than those available in the Sandia resource. Perhaps there are some 100,000 hosts and many millions of client computers on or potentially on the Web. Eventually, we might expect digital set top boxes to be widely deployed giving worldwide, hundreds of millions of PC class devices hooked to a common information network.

In short, the Web offers a loosely connected collection of tens of millions of computers; a central parallel supercomputer (MPP or massively parallel processor) offers a custom network of high-performance C.P.U.'s with a total power that is some 0.1% of that on the Web. Of course, the custom network of the MPP has substantially higher bandwidth and lower latency than any current or future projected worldwide network. This is even true even if the MPP is built around commodity PC C.P.U.'s and ATM networks. In this article, we will contrast the Web and MPP as approaches to computation and see how they can be advanced synergistically with different problems suited for the two approaches to large scale computation (http://www.npac.syr.edu/users/gcf/sc96tutorial, http://www.npac.syr.edu/users/gcf/hpcc96web, http://www.npac.syr.edu/users/gcf/javaforcsefall96).

II. Parallel and Distributed Computing

Currently, the Web has many concurrent processes, but they are largely independent with different clients sharing a network to access a shared server. The largest parallel computer has many orders of magnitude fewer processes than the Web at any one time - 10,000 for the Sandia machine and many millions for the Web. What is special about the parallel computer? The answer is, of course, that the processes on parallel system are closely coordinated to solve a single large problem and this coordination is technically very hard. In this article, we wish to consider the role of Web hardware and software in addressing such coordinated computing. There is, of course, no clean dividing line between ``coordinated'' and ``uncoordinated'' (or ``typical'' Web) computing. Indeed, if we have a Web-based intranet with a bunch of employees using Web clients to access a central database, then the employees are in some sense coordinated to maximize the success of their company, and indeed the central (or distributed) database would actually use some sophisticated concurrency control and parallel lock manager to implement this coordination. Alternatively, when a Wall Street brokerage runs a Monte Carlo simulation of your portfolio on their parallel computer, most of the work is independent trials with (roughly) the only needed coordination being the formation of global averages, and the use of common input data.

In discussing computing on the Web, we need to consider separately hardware (systems), software and the application areas. Above, we showed that the Web is today solving sophisticated ``somewhat coordinated'' problems while large scale parallel computers are sometimes used to implement ``largely uncoordinated'' (technically called embarrassingly parallel) algorithms, such as Monte Carlo simulation. We will discuss classes of applications and their suitability for different hardware and software problems later in this article.

Turning to hardware contrasts, the full World Wide Web undoubtedly has substantially lower bandwidth and higher latency than a customized parallel computer. Latency is an inevitable consequence of the speed of light on a geographically distributed network. However, an intranet built of ATM/fast ethernet connected PCs or workstations can have very competitive performance to our dedicated parallel machine ([Taubes:96], http://www.cacr.caltech.edu/research/beowulf/ http://cesdis.gsfc.nasa.gov/linux-web/beowulf/beowulf.html http://www.epcc.ed.ac.uk/epcc-tec/documents/techwatch-harnessing/). Some of the performance deficiencies of the web today just reflect youth of the field. The Web will naturally increase in performance as higher quality, and customized software and hardware are produced.

Finally, software is also ambiguous, but in a slightly different way. As we will explain, one can make a very strong case to build much of the systems software for parallel computing in terms of ``Web Technology'' [Fox:95a,95d,96a]. Web servers can replace the daemons often used to control the nodes of a parallel computer; Java is an excellent language for simulation and could challenge Fortran and C++ in this area; Java, VRML and Web clients can be used to build excellent user interfaces including visualization. Here, we wish to leverage the wonderful software produced for a large (the largest) market (the Web) to build infrastructure for a relatively small field-parallel computing. The latter can concentrate on areas where its special expertise has high value. Here, a good example is a parallel compiler for so called ``distributed shared memory'' which is designed to support a convenient programming environment for the sophisticated coordination allowed by the single shared memory model. Implementing such a compiler in a Web environment brings new capability to this world, and is an example of how the special expertise and needs of a small field add value.

Thus, in studying ``computing with the Web,'' we are merging the concepts of parallel and distributing computing; of specialized and generic network interconnects; of intranets and internets. The underlying theme is the coordinated solution of a single problem on a set of distinct but networked compute resources. As exemplified above, we will find hardware designs, software systems, and applications areas with particular customizations and optimal linkages. However, there is great potential in merging ideas. For instance, the Web can learn about coordination (synchronization) and nifty parallel algorithms to marshal the power of its resources to solve major problems. The parallel computing community can inherit an excellent base distributed computing software infrastructure and use a richer variety of computing platforms.

III. Hardware Prospects

Today, the largest supercomputers (which always involve some sort of parallelism) can deliver approximately operations per second - corresponding to the efficient cooperation of some high performance personal computers or workstations.

We will follow the so called Federal Petaflops study (http://www.aero.hq.nasa.gov/hpcc/petaflops/peta.html) and look ahead 10 years using the Semiconductor Industry Handbook. A high performance CPU chip in the year 2007 can be expected to have a peak performance of around 32 Giga(fl)ops achieved using some form of internal parallelism from a basic computational unit with a clock speed of 1 GHz and 32-way intrachip parallelism. Computer memory is expected to build around 2 Gigabyte memory parts (http://www.npac.syr.edu/users/gcf/petastuff/petasoftwp). In Table 1, we summarize possible parallel processors either built conventionally or from exotic technology, such as superconducting material or from novel architectures such as ``processor in memory'' where CPU and memory cells are interleaved on the same chip. We contrast this with two Web based parallel systems - a corporate intranet or the full Web gotten with a PC (or digital set top box) in every home.

Table 1: Supercomputer Architectures in the Year 2007
Central Parallel Processors (Giga = , Tera = , Peta = , Exa = )

A: Conventional Distributed Shared Memory Silicon Architecture

Clock Speed, 1 GHz
Four eight-way parallel complex C.P.U.'s per processor chip, giving a peak 32 Gigaops performance per chip
8,000 processing chips giving 0.32 Petaops peak performance
32,000 2 Gigabyte memory chips, giving 64 Terabytes of memory

B: Processor in Memory (PIM) Design

Consider an array of 64,000 previous generation (1/2 Gigabyte) memory chips
Divide memory real estate equally between C.P.U.'s and memory
Total system has 16 Petaops performance and 16 Terabytes of memory

C: Superconducting Design

200 GHz superconducting C.P.U. with negligible cache and simple architecture, giving 200 Gigaops performance (this is probably conservative)
Conventional memory subsystem
10,000 supercomputing C.P.U.s and the same memory as option A: giving 2 Petaops performance and 64 Terabytes of memory

D: Distributed Web Computer Large Corporate IntraNet

2,000 high performance personal computers
each with 4 chip C.P.U.'s and each chip as in option A
Each PC has 16 two-Gigabyte memory chips
Peak performance 0.32 Petaops and 64 Terabytes of memory - all identical to machine in option A

E: World Wide Web

200 million digital set top boxes
Each with a single 1/2 Gigabyte memory chips and 8 Gigaops processor
Peak performance 1,600 Petaops and 100 Petabytes of memory

In the year 2007, a large corporate intranet or straight forwardly extrapolated Supercomputer will have a few hundred Teraops performance. Novel architectures for the supercomputer could lead to an order of magnitude higher performance while the total Web is at least three orders of magnitude greater in potential performance. Of course, the factor of 1,000 between supercomputer and Web performance represents to some extent, society's investment. A supercomputer is between $20M and $100M in cost. The Web represents more or less the total computer/entertainment industry with an annual dollar volume which is some thousand times greater.

For a parallel computer to be effective, the individual processors must communicate with each other. The appropriate ways of doing this is a much studied area of computer architecture, but we can characterize it in a simple way to decide on the utility of our Web computers. First, note that the needed communication bandwidth is highly application dependent. For the embarrassingly parallel applications discussed earlier it is very low; for a parallel computational fluid dynamics problem, it is much larger. Consider the latter where we estimate the needed communication time per word is related to typical calculation time by

Note, a word is 64 bits and 64 Megabytes is used as a typical memory size today.

The ``50'' reflects a rule of thumb reflecting the typical number of operations per communicated information while the memory ratio in the square root notes that as computer memory sizes increase, the needed parallel processing communication bandwidth decreases. This is intuitively reasonable as communication between entities is an ``edge'' effect and the square root comes from the natural two-dimensional edge over area dependence - we take 2D to reflect the natural geometry of a geographically distributed computer (see detailed discussion in Chapter 3 of http://www.npac.syr.edu/copywrite/pcw/).

For class D) in Table 1, the corporate IntraNet Web computer, we find a needed bandwidth per PC of about 4 Gigabits/sec. This seems quite realistic for a higher-performance ATM (~OC96) interconnect on that time scale.

The reduced memory and processor power of the assumed Web node in Table 1E) leads to the same needed communication estimate. As this communication performance is much higher than the 1-20 Megabits/sec needed for delivery of digital video, it is not clear if this is likely to be deployed by a hard nosed commercial industry, although again an ATM connection would suffice.

Another interesting characteristic of a parallel machines is the cross section bandwidth or total communication performance gotten by cutting machine in two. We can estimate what cross country communication bandwidth, the Web needs to be a ``real parallel computer''. Using a simple geometrical argument, we get the bandwidth across the heartland as:

which corresponds to about 10,000 OC96 links across the country.

In summary, Petaop class performance in the year 2007 on ``coordinated'' applications is quite realistic on either intranets or specialized parallel machines. The full Web offers much higher total CPU performance, but would be used for less tightly coupled problems, as the network performance will be insufficient for some applications.

IV. Concurrency in Applications

In understanding the role of the web in large-scale simulations, it is useful to classify the various forms of concurrency in problems into four types ([Fox:96b]).

Data Parallelism
This is illustrated by natural parallelism over the particles in a molecular dynamics computation; over the grid points in a partial differential equation; over the random points in a Monte Carlo algorithm. In the Web computation of the factors of RSA130 (http://www.npac.syr.edu/factoring), we can consider the parallelism over possible trials in the Sieve algorithm as the ``data'' for data parallelism in this application. Data parallelism tends to be ``massive'' because computations are typically time consuming due to a large amount of data. Thus, data parallelism is parallelism over what is ``large'' in the problem. It is not difficult to find data parallel problems today with parallelism measured in the millions (e.g., a 100 x 100 x 100 grid) and by the year 2007, billion-way data parallelism can be expected.
Functional Parallelism
Here we are thinking of typical thread parallelism, such as the overlap of computation (say, decompressing an image) and communication (fetching HTML from a server). More generally, problems typically support overlap of I/O (disk, visualization) with computation. We also, of course, can have multiple compute tasks executing concurrently. This form of parallelism is present in most problems; the units are modest grain size (larger than a few instructions scheduled by a compiler, smaller than an application), and typically not massively parallel. Further, such functional parallelism is typically implemented using a shared memory and, indeed, its existence in most problems makes few way parallel shared memory multiprocessors very attractive.
Object Parallelism
We could mean many things by this, but we have in mind the type of problems solved by discrete event simulators. These are illustrated by military simulations where the objects are ``vehicles,'' ``weapons,'' or ``humans in the loop.'' The well-known SIMNET or DSI (Distributed simulation Internet) have already illustrated the relevance of distributed (Internet) technology for this problem class (http://www.dmso.mil/). Object descriptions are similar to data parallelism except that the fundamental units of parallelism, namely objects are quite large, corresponding to a macroscopic description of an application. Thus, a military battle is described in terms of the units of force (tanks, soldiers) with phenomenological interactions rather than in (unrealistic in this case) fundamental description in terms of atomic particles or finite element nodes. For a typical ``data parallel'' problem, the fundamental units of parallelism (grid points) are typically smaller.
Metaproblems
This is another functional concurrency, but now with large-grain size components. In image processing, one often sets up an analysis system where the pixels are processed by a set of separate filters - each with a different convolution or image understanding algorithms. Software systems, such as AVS (http://www.avs.com/) and Khoros (http://www.khoros.unm.edu/) are well-known tools to support such linked modules. So a metaproblem is a set of linked problems (databases, computer programs) where each unit is essentially a complete problem itself. Dataflow (a graph specifying how problems accept data from previous steps and produce data for further processing) is a successful paradigm for metaproblems. In manufacturing, one often sees metaproblems as building a complex system, such as an aircraft, requiring linking airflow, controls, manufacturing process, acoustic, pricing and structural analysis simulations. It has been estimated that designing a complete aircraft could require some 10,000 separate programs - some complicated ones such as airlow simulation were mentioned above, but as well there are simpler but critical expert systems to locate inspection ports, and other life-cycle optimization issues ([Fox:95f,96a].
Metaproblems have concurrency that it typically quite modest. They differ from the examples, in category 2 above, in that the units have larger grain size and are more self contained. This translates into different appropriate computer architectures. Modest grain size functional parallelism (2) needs low latency and high bandwidth communication - typically implemented with a shared memory. Metaproblems are naturally implemented in a distributed (Web) environment - latency is often unimportant while needed network bandwidths are more variable.

V. Overview of Web and Parallel Computing Software Issues

We are considering possible software models for the various parallel and distributed computers of Section III. and the application classes of Section IV. As shown in Figure 1, we can view computing (as many other enterprises) in terms of a pyramid with widely deployed cheap systems at the bottom of the pyramid, and the few high-performance systems as the top. Our previous analysis shows that there is much more computing power in the distributed collection of consumer-oriented products - PCs, videogames, Personal Digital Assistants, Digital Set Top boxes, etc. This dominant dollar investment in the consumer products implies that one can expect the bottom of the pyramid to have much better software than the top. Software investment must be roughly proportional to market size, and so we see PCs, workstations, and MPPs (Massively Parallel Processors) offering increasing unit software price and decreasing software quality and functionality. The Web, perhaps, offers now the best available software (as it is potentially the largest market). When the PC market dominated quality consumer software, it was hard for the parallel processing community to take advantage of it. PCs offer, of course, a sequential computer model, but now the Web software targets a very rich distributed computing model. It seems to us clear that we can, and indeed must, build MPP software with a backbone architecture of Web software. As discussed in Section II, we can then view parallel processing as a special case of a distributed model with stringent synchronization constraints. We view this as leading to a set of Compute Webs, which we describe in the following sections.

Figure 1. Integration of Large Scale Computing and Web Technologies

This approach has the added advantage that we can build Compute Webs by either running Web clients or servers with synchronization/compute enhanced Web software, or use the latter software to provide a very attractive user environment on specialized MPPs whose low latency and high-bandwidth communication enable critical parallel computations.

In the following, we discuss the role of Web hardware and, especially, software for three distinct parts of computation, shown in Figure 2 below.

Figure 2. Three Aspects of Computation: User, Metaproblem Environment, Detailed Computational Kernels

User (client) view - problem specification, visualization, computational steering, data analysis
Metaproblem implemented on a distributed computer
Individual computationally complex components of the metaproblem implemented on high-performance computers, which could in fact be a distributed system itself (as in category D) of Table 1)).

We cover these three parts - graphical user interface, dataflow for metaproblems/software integration, and hardcore computation, in the next three sections.

VI. WebWindows and the User View

We abstract current high-performance computing environments into four layers, shown in Figure 3 and detailed below (http://www.npac.syr.edu/users/gcf/petastuff/petasoftwp).

Figure 3:4 Levels of a Scientific Computing Environment

a) Fully visual or scripted (interpreted) environment exhibiting domain specific functionality
This is the typical graphical interface allowing manipulation at either metacomputer, or individual component level.
b) Partially scripted level offering
Portable flexible programming at some performance cost - illustrated by Java in applet mode
c) Traditional compiled level
Offering a high-level language with few machine dependent features, and getting high performance---traditionally within about a factor of two of the peak performance possible on the particular algorithm - illustrated by coupled Fortran, C, C++, and Java.
d) Traditional machine specific level
Rarely used by application programmers or even those building (high level) tools. Clearly, allows user to obtain peak performance at the cost of a very inconvenient programming environment.

Lvels c) and d) include the computationally intense parts of the problem, which can be implemented on appropriate servers. However levels a) and b) which we discuss in this section, are likely to be executed in the client machine/environment. We describe the current trends in software strategy ([Fox:95d,96c]) as a shift from software built in terms of PC Windows, Macintosh, UNIX environments to a WebWindows basis, i.e., software built on the interfaces defined by Web servers and Web clients. As shown in Figure 4, this is, of course, a valid approach whether one is writing for a single stand-alone machine or the entire worldwide network. In this sense, the use of Web technology for user interfaces is trivial - the user interface is not constrained greatly by the difficulties of high-performance computation, as it runs on the ``conventional'' client side and so can naturally use best client side technologies. Figures 5-9 show examples of this:

Figure 4: WebWindows for one PC or the World

NCSA's biology workbench which is a CGI interface to a collection of useful computational biology resources (Figure 5 )
An environment built by Gregor von Laszewski to support a metaproblem - the linked components of a large scale NASA weather simulation. This uses a Java graphical editor to allow the user to choose which program component to run on which of a distributed set of computers (Figure 6)
NIST's user interface for their IBM SP2 parallel computer (Figure 7)
The Virtual Programming Laboratory (VPL) built by Kivanc Dincer and used in the Syracuse course CPS615 this semester to support parallel program development (Figure 8)
A typical Java visualization applet to support the VPL of Figure 9.

We expect this type of interface development to continue and become the norm. However, we see a particularly important role for Java (and VRML) in terms of level b) of Figure 3. Namely, Java seems an attractive language for building client side data analysis systems. These typically involve both computation and visualization - in which linkage, Java has unique capabilities. Thus, we expect a set of high quality Java applets (or compiled plug-ins) to be developed which support this analysis. Those applets will be used at level a) of Figure 3 by the general user with the expert modifying the code of the applets (level b)) for customized capability. A good example of Java for scientific visualization is the work of Cornell (http://www.msc.cornell.edu/~houle/cracks) on an applet for teaching fracture mechanics. In Figure 10, we depict the resultant, environment which essentially becomes a Java wrapper for code written in traditional languages and running on sequential, parallel or distributed computers. This use of Java is likely to grow rapidly as it requires modest changes to existing software and adds great value without changing the familiar programming paradigm. However, we see it as a natural Web ``seed'' that can grow into the more pervasive use of Java described in Section VIII.

Figure 10

VII. WebFlow and Coarse Grain Software Integration

As we have discussed, it is very natural to use web hardware and software to implement control of metaproblems (http://www.npac.syr.edu/projects/webbasedhpcc). Although earlier, we only described the dataflow model for this in Section V, one can, of course, use these ideas for any application with linked components that have relatively large chunks of computation that dwarf the latency and bandwidth implied by using the Web as a compute engine. In fact, we can include our recently completed RSA130 factoring project (http://www.npac.syr.edu/users/gcf/crpcrsamay96) in this class. This distributed the sieving operations over a diverse range of clients (from an IBM SP2 at NPAC to a 386 laptop in England) under the control of set of servers. This was implemented as a set of Web server CGI Perl scripts FAFNER (Figure 11 from Jim Cowie of Cooperating Systems at http://www.npac.syr.edu/projects/factoring/status.html). These created daemons to control the computation on each client which returned results to the server that accumulated results for final processing to locate factors. Note that a particularly interesting later computation (155 decimal digit or 512 binary digit factorization) would require about some teraop-month of computation (10,000 Pentium Pro PCs running flat out for a month) and will be quite practical as a Web computing project. 512 binary digit numbers are used as the basis of the security of many banking systems that perhaps fail to realize that modern computing can crack such codes.

We can extract two types of computing tasks from our factorization experience ([Fox:95a]). The first is the resource management problem - identifying computer resources on the Web; assigning them suitable work; releasing them to users when needed, etc. A sophisticated Web system ARMS for this is being developed by Lifka at the Cornell Theory Center (Figure 12 ). Well-known distributed computing systems in this area include LSF, DQS, Codine and Condor (see review in http://rhse.cs.rice.edu/NHSEreview/96-1.html) and this seems a very natural areas for the use of Web systems including linked databases to store job and machine parameters.

The second task is the actual synchronization of computation within a given problem - resource management, on the other hand, assigns problems to groups of machines and does not get involved with detailed parallel computing algorithm and synchronization issues. Here, we see two general concepts. One is support of the messaging between individual nodes that creates a virtual (parallel) machine out of the World Wide Web.

Figure 13: WebVM and WebFlow

This low level support is called WebVM in Figure 13 and must implement the functionality of parallel systems, such as MPI in terms of Web technology message systems - either HTTP or direct Java server - server (client) connections. Here perhaps the most elegant model is based on a mesh of Web Servers (http://www.npac.syr.edu/projects/webspace/doc/hpdc5/hpdc5.html) although today's most powerful implementations would use like FAFNER, a mesh of Web clients controlled by a few servers (http://www.cs.ucsb.edu/~schauser/papers/96-superweb.ps, http://www.javaworld.com/javaworld/jw-01-1997/jw-01-dampp.html). In the spirit of WebWindows (Figure 4), we can expect servers or server equivalent capability to become available on all Web connected machines. Note that the natural Web model is server-server, and not server-client and indeed this supports the traditional NII dream of democracy with everybody capable of either publishing or consuming information.

On top of WebVM, one can build higher level systems, such as the distributed shared memory model described in Section II (called WebHPL in Figure 13) or more easily an explicit message passing system, such as the dataflow model. WebFlow (Figures 13 and 14) supports a graphical user interface ([Fox:95a,95d], http://www.npac.syr.edu/projects/webspace/doc/webflow/dual96.html, http://www.npac.syr.edu/projects/webspace/doc/sc96/handout/handout.ps) specifying metaproblem component linkage and one can naturally design domain specific problem solving environments in this framework.

Figure 14: Server Structure for WebFlow

In the notation of Figure 3, one would support scripted ``little languages'' (designed for each application) at the top level a) which would allow for more flexible and dynamic metaproblem component linkage. Figure 14 shows a suggested Java server implementation using servlets as supported by JavaSoft's Jeeves server.
Now is, of course, a confusing time for, as shown in Table 2, there are as many compute-web implementation strategies as there are major players in emerging Web technology - especially as we evolve from powerful, but rather ad hoc server side CGI scripts to integrated dynamic Java client and server systems. Thus, now is not the time for ``final solutions'' but rather for experimentation and flexibility to examine and influence the key building blocks of future Web computers.

Table 2: WebVM Components: Implementation Options

Habanero Jigsaw Infospheres JavaSoft Netscape

Module collaboratized
applet Resource dapplet Java Bean LiveWire app,
server

Port/Channel Java socket any HTTP
carrier portlet RMI custom?

Message Marshalled
Event or Action Pickled
Resource any object
bytesteam Serialized
Object JavaScript

Compute-Web Star topology 2-node any topology

Runtime Collaboratory
server Java HTTP
server dapplet
manager? Jeeves
(Java server) community or
enterprise system

User
Interface AWT Forms Visual
Authoring? HotJava Navigator

Coordination instantaneous
broadcast client-server asynchronous
multi-server CORBA multi-server

Persistency Resource Store flat file? JDBC Live Wire -> DB

Publication javadoc

	Habanero	Jigsaw	Infospheres	JavaSoft	Netscape
Module	collaboratized applet	Resource	dapplet	Java Bean	LiveWire app, server
Port/Channel	Java socket	any HTTP carrier	portlet	RMI	custom?
Message	Marshalled Event or Action	Pickled Resource	any object bytesteam	Serialized Object	JavaScript
Compute-Web	Star topology	2-node	any topology
Runtime	Collaboratory server	Java HTTP server	dapplet manager?	Jeeves (Java server)	community or enterprise system
User Interface	AWT	Forms	Visual Authoring?	HotJava	Navigator
Coordination	instantaneous broadcast	client-server	asynchronous multi-server	CORBA	multi-server
Persistency		Resource Store	flat file?	JDBC	Live Wire -> DB
Publication				javadoc

Note that the Web encourages new models for computation with problems publishing their needs and Web compute engines advertising their capabilities and dynamic matching of problems with compute resources.

An area related to this and encompassed in a dynamic WebFlow model is the potential for mobile code and a computational environment constructed by dynamic agents. The mobility of executable code is brought by Java from the research labs to the mainstream computing. At the moment, it is used mainly to move applets from servers to clients but as we evolve from client-server towards server-server computing on the Web, we will be witnessing soon a new family of Java agents that travel across the Web, spawn slaves, scatter and gather to seek the requested information and bring it back to the user. While the industry will likely focus the Web agent technologies on areas such as datamining, smart messaging, shopping and banking, the computational science community will be able to build new generation smart computational environments on top of the emergent Web agent technologies. Several years ago, in our early hypercube research, we developed a set of optimal algorithms for smart cooperative dataflow, relevant for a family of massively parallel computational problems. Similar but fuzzy version of this approach is naturally applicable for the Web computing. Here the base Internet protocol or its higher level agent based smart messaging can be viewed as a fuzzy version of our 'crystal router' algorithm, but we also developed collective dataflow strategies for computational problems such as FFT and other hierarchical tree or network topologies. We are now planning a similar approach for mapping adaptive CFD and other PDE problems to the WebVM architecture. While deterministic synchronization is impossible in such environments, smart final element nodes, patches or domains will autonomously find their computational hosts and the whole computational grids will continuously adapt to the varying available resources. Java mobility across a mesh of Java Web servers opens the door for this novel type of world-wide distributed computing where the individual host owners simply contribute some fraction of their CPUs to a computational community and the smart adaptive webflow algorithms take care of the coordination and loose synchronization. Our RSA130 Factoring-By-Web was an early experiment of this type but one can imagine a broad family of such problems, continously attacked by the computational Web community in the areas such as environmental protection, weather forecast, economic growth simulation - to just name a few challenges where people will likely donate their cycles to enable new world-wide computation paradigms.

VIII. Java as the Language for Computational Science and Engineering

We recently held a workshop (http://www.npac.syr.edu/projects/javaforcse) on this theme at Syracuse. This covered generally the topics of the last two sections where we saw Java as clearly attractive for both user interfaces, wrappers, and the metaproblem control. Here, we consider its possible role as the basic programming language for science and engineering - taking the role now played by Fortran 77, Fortran 90, and C++.

Java's most important advantage over other languages is that it will be learned and used by a broad group of users. Java is already being adopted in many entry level college programming courses and will surely be attractive for teaching in middle or high schools. Java is a very social language as one naturally gets Web pages from one's introductory Java exercises that can be shared with one's peers. We have found this as a helpful feature for introductory courses. Of course, the Web is the only real exposure to computers for many children, and the only languages they are typically exposed to are Java, JavaScript, and Perl. We find it difficult to believe that entering college students, fresh from their Java classes, will find it easy to accept Fortran, which will appear quite primitive in contrast. C++ as a more complicated systems building language may well be a natural progression, but although quite heavily used, C++ has limitations as a language for simulation. In particular, it is hard for C++ to achieve good performance on even sequential and parallel code, and we expect Java not to have these problems.

In fact, let us now discuss performance, which is a key issue for Java. As already shown in Figure 3, we suggest a multilevel scientific programming environment that would use purely scripted, applet mode and purely compiled environments with different tradeoffs in usability and performance. As discussed at our workshop, there seems little reason why native Java compilers, as opposed to current portable JavaVM interpreters or just in Time compilers (JIT), cannot obtain comparable performance to C or Fortran compilers. A major difficulty is the rich exception framework allowed by Java that could restrict compiler optimizations. Users would need to avoid complex exception handlers in performance critical portions of a code.

An important feature of Java is the lack of pointers and their absence, of course, allows much more optimization for both sequential and parallel codes. Optimistically, we can say that Java shares the object oriented features of C++ and the performance features of Fortran.

One interesting area is the expected performance of Java interpreters (using just in time techniques) and compilers on the Java bytecodes (Virtual Machine). Here, we find today perhaps a factor of 4-10 lower performance from a PC JIT compiler compared to C compiled code (see Figure 15) and http://www.npac.syr.edu/users/gcf/edperf.html). Consensus expected this performance degradation to be no worse than a factor of 2 for the portable applet mode. As described above, with some restrictions on programming style, we expect Java language or VM compilers to be competitive with the best Fortran and C compilers. Note that we can also expect a set of high performance ``native class'' libraries to be produced that can be down loaded any accessed by applets to improve performance in the usual areas one builds scientific libraries.

One interesting ommission is in the framework of Figure 3, a purely interpreted version of Java - level a). This would also be very helpful for teaching. JavaScript is interpreted, but we would view it as a ``little language'' for document handling - and not a general Java-like interpreted environment.

Finally, we will discuss parallelism in Java. Here, we return to the four categories of concurrency in Section IV.

Data Parallelism
This is supported in Fortran by either high level data parallel HPF or at a lower level Fortran plus message passing (MPI). Java does not have any built in parallelism of this type, but at least the lack of pointers means that natural parallelism is less likely to get obscured. There seems no reason why Java cannot be extended to high level data parallel form (HPJava) in a similar way to Fortran (HPF) or C++ (HPC++). We have started a discussion and even a simple implementation shown in Figures 16 and 17 (http://www.npac.syr.edu/users/gcf/hpjava3.html and http://www.npac.syr.edu/users/dbc/HPJava). At the lower message passing level, the situation is clearly satisfactory for Java as the language naturally supports inter-program communication, and the standard capabilities of high-performance message passing are being implemented for Java (http://www.mcs.anl.gov/globus).
Modest Grain Size Functional Parallelism
This is built into the language with threads for Java and has to be added explicitly with libraries for Fortran and C++.
Object Parallelism
This is quite natural for C++ or Java where the latter can use the applet mechanism to portably represent objects. We have built a collaboration system TANGOsim where a Java server controls a set of Java applets and other applications spawned from them (http://www.npac.syr.edu/projects/tango). We generalized the session manager present in collaborative systems to be a full event driven simulator. This illustrates the power of Java for this problem class and shows that it can unify traditional time stepped simulations (typical for data parallelism) with event driven forces modeling, and other such simulations.
Metaproblems
We have already discussed in Section VII, the power of Java in this case for overall coarse grain software integration.

In summary, we see that Java has no obvious major disadvantages and some clear advantages compared to C++ and especially Fortran as a basic language for large scale simulation and modeling. Obviously, we should not and cannot port all our codes to Java. Rather, we can start using Java for wrappers and user interfaces. As compilers get better, we expect users will find it more and more attractive to use Java for new applications. Thus, we can expect to see a growing adoption by computational scientists of Web technology in all aspects of their work.

References

[Fox:95a] Fox, Geoffrey C., Furmanski, Wojtek, Chen, Marina, Rebbi, Claudio, and Cowie, James H., ``WebWork: Integrated Programming Environment Tools for National and Grand Challenges,'' Syracuse University Technical Report SCCS-715, June, 1995.

[Fox:95d] Fox, Geoffrey C., and Furmanski, Wojtek, ``The Use of the National Information Infrastructure and High Performance Computers in Industry,'' in Proceedings of the Second International Conference on Massively Parallel Processing using Optical Interconnections, IEEE Computer Society Press, Los Alamitos, CA, 298-312. Syracuse University Technical Report SCCS-732, October 1995.

[Fox:95f] Fox, Geoffrey C., ``High Performance Distributed Computing,'' Syracuse University Technical Report SCCS-750, December 1995. To appear in Encyclopedia of Computer Science and Technology.

[Fox:96b] Fox, Geoffrey C., ``An Application Perspective on High-Performance Computing and Communications,'' Syracuse University Technical Report SCCS-757, April 1996.

[Fox:96a] Fox, Geoffrey C., ``A Tale of Two Applications on the NII,'' Syracuse University Technical Report SCCS-756. In the proceedings of the 1996 Sixth Annual IEEE Dual-Use Technologies and Applications Conference, June 1996.

[Fox:96c] Fox, Geoffrey C., and Wojtek, Furmanski, ``SNAP, Crackle, WebWindows!'', published as RCI Maagement White Paper Number 29, Syracuse University Technical Report SCCS-758, April 1996.

[Taubes:96] Taubes, Gary, ``Do-It-Yourself Supercomputers,'' Science, 274:5294, 1840, December 13, 1996.

Computing on the Web New Approaches to Parallel Processing Petaop and Exaop Performance in the Year 2007