HPCC and Java -- A Report by The Parallel Compiler Runtime Consortium (PCRC)

********* INCOMPLETE DRAFT * May 12 1996

This is: http://www.npac.syr.edu/users/gcf/hpjava3.html

Original is: http://www.npac.syr.edu/users/gcf/hpjava.html

Northeast Parallel Architectures Center at Syracuse University: Geoffrey Fox,Bryan Carpenter, Michael Chang, Kivanc Dincer, Wojtek Furmanski, Don Leskiw, Xiaoming Li
Cooperating Systems: Marina Chen, James Cowie
Harvard University: Thomas Cheatham, Amr Fahmy
Indiana University: Dennis Gannon, Rajeev Raje
Rice University: Ken Kennedy, Chuck Koelbel, John Mellor-Crummey
University of Maryland: Joel Saltz, Alan Sussman
University of Rochester: Wei Li, Michael Scott
University of Texas: Austin: Jim Browne

Proposed Writing Responsibilities

Northeast Parallel Architectures Center at Syracuse University:: 1.1, 1.2 ,2.1, 2.2, 2.3, 2.7, 6.3, 8.2
Cooperating Systems:: 4.1, 4.2 4.3
Indiana University:: 1.4, 3.3, 6.2, 7.2, 7.4, 8.3
Rice University:: 5.1, 5.2(shared), 5.3, 5.4
University of Maryland:: 2.4, 2.5, 3.4, 8.1
University of Rochester:: 2.6, 5.2(shared), 6.4, 7.1, 8.4
University of Texas:: 1.3, 3.1, 3.2, 6.1, 7.3

General Approach:

The preliminary planning and proposed draft is available -- this contains lots of useful words and references(links)
Current discussion is available using mailing list pcrc-java@npac.syr.edu or with Newsgroups and database support on our SP2.
March 13 :2 deep outline with assignments proposed
Mid April -- all assignments complete -- editing etc. starts
Draft must be ready for public but friendly reading by CRPC meeting in Mid May
Based on the manifesto we will call a meeting to decide what areas PCRC will focus on

Some comments from Geoffrey Fox:

There is some debate as to name we should use for the extensions of Java to link it to HPCC. Possibilities suggested are:
- HPCC and Java
- HPCCJava , HPJ or
- HPJava
- Java-powered HPCC or HPCC powered Java
- Parallel Computing and Java
Unless changes are suggested, I will use "HPCC and Java" in high-level places likes titles and the acronym HPJava where nifty short designations are appropriate.
The discussion of data parallelism in original contributions showed some disagreement as to whether this is a good, bad, feasible, infeasible idea. I suggest that this particular document make no decisions but simply state the pros and cons of particular approachs in the short and long term.

Current Draft Report

Note section numbers such as 2.1syr refer to text in original document and are hyperlinked to it

1: Introduction

1.1 What is Java and What is HPCC and why they need each other SYRACUSE

Java is rapidly becoming a dominant distributed computing language driven by the the breadth and depth of the World Wide Web. It implements a natural object or Applet distributed parallelism combined with a classic light weight thread mechanism within a given applet i.e. within a given (SMP) processor. HPCC has developed technology and the application pull for large scale computation with typically tighter synchronization constraints than those of Java. Further HPCC can benefit from the pervasive software base illustrated by Web in general and Java in particular. Correspondingly there are many emerging Web based applications which will need large synchronized computation.
For these reasons, we thought it useful to examine the confluence of HPCC and Java -- referred to as HPjava. In particular it is natural for PCRC to examine its software indrastructure and see how it should be structured to support HPJava.
This document is not a proposal or plan. Rather it is an often conflicting(!) study of issues that emerge when you place Java and HPCC next to each other.

1.2 The HPCC Application Motivation for Java -- Meta-Challenges

(2.1syr) SYRACUSE -- from first draft

2.1Syr) Introduction: First some preliminary remarks. We classify all (interesting) HPCC applications as belonging primarily to one of two domains: Grand Challenges or National Challenges -- the "HPCC-Challenges". Recognize that today's technology is blurring the distinctions between parallel and distributed processing, e.g., the use of ATM-LANs by networks of workstations/PCs. The "traditional" high-performance applications require finely tuned and optimized implementations of runtime support (intrinsics and communications) on their host processors. The current designer of runtime systems seeks to provide the application programmer with easier access the high performance potential of a given machine(s)/ architecture. Beyond speed, however, National Challenges can pose additional requirements such as trust or privacy, i.e., security, and here a runtime system designer must also consider metrics like "response time" or "transactions per second".

1.3 The Nature of Java Applications and their need for HPCC

(1.1tex, 1.7ind cf: sec. 2.7 below) TEXAS

The issue is not the nature of current Java applications but rather the range of applications implementable with Internet driven technology including Java and requiring HPCC as an integral component. These applications will all be some form of integration among HPCC programs or HPCC programs and humans or HPCC programs, humans and electromechanical systems. The nature of these applications is coupling of humans, electromechanical and computer simulated systems to represent total systems on a scale previously unattainable with any technology. Critical properties of these total systems will be dynamic structure and need for real-time response.

The goal of most of these integrations will be simulations of the execution behavior of extremely complex systems either for the purpose of generating understanding of the behavior of the systems or for training humans to function effectively as components of these systems. ( An example of a integration of humans and HPCC programs is battle simulation.) Another application domain will be the fusion of complex knowledge structures and computations to support realtime decision making. An example might be management of a large scale accident at a nuclear power plant. Another example is intelligent management of large scale experiments. This application may require integration of the control of large scale experiments with access to large scale databases and execution in realtime of large scale computations .

These applications can be regarded as Super Grand Challenges because they involve the integration of multiple grand challenge scale applications. (I prefer "Super Grand Challenges" to "Meta-Grand Challenges" since "meta" has the connotation of being a template for construction. But perhaps that is real meaning.) The essential message is that Internet-enabled applications consisting of grand challenge scale applications coupled by Java or Java-like technology carry the applications of computation into a new range of systems previously unreachable. This is the major concept we want to put forward.

1.4 Lessons from and Relation to HPF HPC++ Fortran-M and other HPCC Software Systems

(3.1ind, 3.2ind, 3.3ind, 4.4csc) INDIANA from first draft

1.4.1: From 3.1ind) Comparison with HPF:

On the HPF side it is possible to create a class of distributed objects that are similar to the HPF distributed array. Because it can be defined as a class, there would be no need to create a special extension of the array data type or an annotation to support distribution. this is what is done in HPC++. In our opinion, the HPF model of data parallelism may not be the most interesting thing to do with Java. If you want HPF features, then use HPF.

1.4.2:From 3.2ind) Comparison with HPC++:

Because Java is derived from a simple model of C++, some of the HPC++ ideas could move to java. For example, annotations for parallel loops. Other ideas, like the HPC++ parallel STL could only move to Java if

1) There was a template mechanism added to Java. this may not be a bad idea.
2) Because java has no pointers another mechanism must be found to replace the STL concept of a parallel iterator. This might be a smart array index object in Java.

1.4.3:From 3.3ind) Operator Overloading:

Also java does not provide an overloaded operator mechanism, so building a parallel array class is not as easy as in HPC++ (or even Fortran 90.)

1.4.4:From 4.4csc) Comparison with C++:

In C++, two flavors of extensions appeared: small tweaks for sugaring the syntax of parallel constructs, and deep changes for modifying the semantics of the language to support concurrency. In HPJava, a limited preprocessor which implements some shallow (sugaring) syntax changes may be a reasonable project at a later stage, after the library approach has matured enough to identify the most egregious kinds of syntactic clutter.

1.4.5: Channels and Fortran-M in Java (Syracuse)

Most of this section should be merged in section 4.

1.4.5.1:Introduction:

Fortran M incorporates a model of mobile point-to-point communication channels. (Similar ideas also emerged in work on dynamic extensions to the `occam' programming language.)

In this (established HPC) communication model a ``channel'' has two ends. A channel end can be bound to a ``port''. Channels and Ports are opaque primitives of the language (in an OO language they would probably be realised as classes). At any time a channel end is bound to at most one port.

Two processes holding ports bound to opposite ends of the same channel can communicate directly by passing data on the channel.

Crucially, a process can also send a channel-end across a channel. Suppose P and R are ports declared in process A, and Q and S are ports declared in process B. C and D are channels. The ends of C are bound to P and Q, and one end of D is bound to R, which S is presently null (no channel end is bound to it).

                     D
    A:      P     R----...
            |
          C |
            |
    B:      Q     S

(The other end of D may be bound to a port in A, or B, or any other
process).

     Suppose A and B respectively execute operations like

  P.send(R)

and

  Q.recv(S)

When the communication completes the port R becomes null and the end of
D is bound to S:

    A:      P     R
            |
          C |
            |        D
    B:      Q     S----...

(The other end of D is logically unaffected by the operation).

Because channels have only two ends, communication of a channel end can be achieved by a simple negotiation involving 3 processes (in the example A, B and the process at the other end of D) or at most 4, if both ends of a channel are moving at the same time.

1.4.5.2: Java and Channels

This channel communication mechanism could be adopted in extended Java applets.

A possible scenario is that several ``applets'' are created on cooperative servers. They have all been downloaded from some central application (the parent). The new applets have an initial channel connecting them to this parent application (an umbilical channel?). The parent application creates new channels locally, and sends the ends to different applets across the umbilical channel. In this way any pattern of channel connectivity can be achieved between the applets. The channel connectivity can, for example, support data parallel computation distributed across the applets.

In another scenario, two Web clients access a server providing some (unspecified) communication service. They receive an applet each, and negotiations with the server establish that the clients want to communicate directly, without intervention of the server. The server sends them the two ends of a channel it creates, and the client applets communicate over this channel.

One can carry on, and invent more interesting and dynamic scenarios, where the applets proceed to exchange channel ends amongst themselves.

1.4.5.3: Implementation

Is clearly possible, on top of standard internet interfaces. One subtly is the protocol for exchanging channels ends. In a reliable environment this isn't very difficult. Perhaps there are special considerations in an unreliable environment.

1.4.5.4: Related ideas

Related or analogous ideas: ports in the Hermes programming language, Milner's pi-calculus, ``capabilities'' (?).

2: Further Details of Motivating Applications: If gets too long, details can go into appendices

2.1 General Discussion of Meta-Challeges

(2.2syr) SYRACUSE from first draft

We define a new class of applications called "Meta-Challenges". These are loosely coupled sets of physically distributed instances of Grand or National Challenges. A DoD example of a Meta-Challenge is Distributed Interactive Simulation (DIS). DIS involves the modeling of forces (tanks, aircraft, satellites), environments, rules of engagement and tactics, etc. Simulation control and monitoring for red, blue, and white players are provided. A simulation may need to grow with the addition of "green", "yellow", or neutral forces, and have to scale form peace-time/peace-keeping through low-intensity-conflict (LIC), to massive conflict. Requirements for considering chemical, biological, or nuclear (CBN) environments are needed, for example, modeling operations in a nuclear environment (OPINE).

2.2 An Example -- Distributed Interactive Simulation DIS

(2.3syr, 2.4syr) SYRACUSE from first draft

2.2.1: From 2.3Syr) An Example Meta-Challenge: DIS:

Java offers a means to enhance the existing DIS multi-user protocol. A new DIS protocol, Advanced Distributed Simulation (ADS), is an attempt to introduce scalability. ADS breaks apart the DIS into an active-reactive part and an interactive-communicating part. Java is consistent with such an evolutionary approach. Java applets could update models on-the-fly, as with dynamic weather, battle damage, or threat-responsive counter-countermeasure techniques. One of the greatest challenges to the DIS designer is finding ways to interconnect independently built simulations having disparate modeling granularity (i.e., the ALS protocol, Aggregate Level Simulation), and then increase the number of models and improve fidelity, but with no-greater demand on network bandwidth. Weapon system design has also been evolving from integrating interoperable subsystems to influence management among ubiquitous supersystems. (An example of the latter would be Information Warfare (IW). Another is counter-OSINT --open source intelligence). BM/C3 is undergoing changes in definition: BM/C4 (computing), C3I with intelligence, and IS&R for surveillance and reconnaissance. Also, a fifth-C, C5I is sometimes added for emphasizing the collaboration among humans of any C3I system. (The relationship between an Air Traffic Controller, the ATC surveillance, tracking, processing and display system, and a pilot is another example). From a user's (Commander's) perspective, Java is extremely attractive for simulating (or even building) these systems since its roots are in the distributed, interactive, object-oriented, collaborative user domain. It is rapidly becoming the language of choice for programming the user-interface of on-line commercial multimedia systems.

2.2.2: From 2.4Syr) Java for DIS:

Java is intended to be machine independent and offers to challenge the need for hegemony among systems. For example, the ideal computing architecture for simulating military battle management including command and control is multiple instances of "loosely coupled sets of tightly coupled processors." Such always poses a severe configuration management problem: multiple hardware, operating system, and programming languages. Interoperability among systems of systems and the interfacing between "stovepipe" systems is related challenge. Java could provide an approach for "wrapping" existing or planned systems with a functionality layer for interoperability while ameliorating the bandwidth problem by intelligently transmitting only those parts of a message (applet) that are immediately needed by an application (user).

2.3 An Example -- Manufacturing

SYRACUSE

This section will describe the work with Madic that we formed for NASA to analyse the role of the NII in the Manufacturing of Complex systems. This involves a set of manufacturing companies -- Rockwell International, Northrop Grumman, McDoinnell Douglas, General Electric and General Motors and a particular MAD (Multidisciplinary Analysis and Design) system -- the so-called "Affordable Systems Optimization Process" (ASOP). The complexity of this NII application is indicated by the estimate that the next major aircraft to be built could involve:

6 major companies and 20,000 smaller supplier subcontractors
Number of engineers involved is about:
50 at conceptual design
200 at preliminary design
2000 at final design
upto 10,000 in manufacturing and development
The design could involve upto 10,000 separate programs running in small linked clusters which vary from
Airflow simulation around plane to expert system to plan location of inspection port to minimize maintenance costs
Critical is configuration management and system database

We see that this application involves a crucial mix of HPCC and distributed computation and could provide a critical motivator for HPJava. Note Java itself could could be very useful for the myriad of small applications and visualization of results. HPJava comes in when one integrates in the large simulations of either CFD or the whole complex system represented by an aircrft.

2.4 An Example -- Remote Sensing and Data Fusion

(2.5umd) (UMD)

There are many applications that are likely to require the generation of data products derived from multiple sensor databases. Such databases could include both static databases (such as databases that archive LANDSAT or AVHRR images) and databases that are periodically updated (like weather map servers).

Two examples of remote sensing defense applications are battle management and distributed simulation. A battle management application could take advantage of multiple sensing devices, both airborne and on satellites. Data from multiple remote sensing devices would be integrated to look for various untoward events, such as a collection of tanks on the move, aircraft taking off from an enemy runway, etc. A contemporary example would involve real time tracking of rocket attacks in northern Israel. One distributed simulation application would be to train soldiers in a hybrid environment, consisting of a combination of simulated and actual troops and equipment. For instance, a commander could command some number of real troops, tanks and planes, as well as many more simulated troops and pieces of equipment. Developing the technology to allow for such a hybrid training environment is already part of an important ARPA program that is well funded. At the last ARPA meeting, the claim was made that some form of hybrid battle training had been successfully tested in Europe.

The execution model for these applications is for a client to direct (High Performance) Java programs to run on the database servers. The Java programs would make use of the server's computational and data resources to carry out (at least) partial processing of data products. Portions of data processing that required the integration of data from several databases would be carried out on the client or via a Java program running on another computationally capable machine. The main question for such an model is whether the programs executing on the servers are solely Java programs, or Java programs that utilize the services of programs written in other languages (e.g. High Performance Fortran).

2.5 An Example -- Medical Interventional Informatics

(2.6umd) (UMD)

The first set of applications apply to medical emergencies associated with armed combat, terrorist actions, inadvertent releases of toxic chemicals, or major industrial plant explosions. In such scenarios, the goal is to rapidly acquire and analyze medical data as the crisis unfolds in order to determine how best to allocate scarce medical resources. The applications involve the discovery and integration of multiple sources of data that might be in different geographical locations. A doctor could, for instance, determine the best diagnosis and treatment for patients exposed to combinations of chemicals whose effects have not been well characterized. This class of scenarios is motivated by a recent ARPA-sponsored National Research Council workshop on crisis management.

Another set of applications involves discovering and exploiting clinical patt erns in patient populations. These scenarios involve study of health records to determine risk factors associated with morbidity and mortality for specific subpopulations (e.g. various substrata of armed forces personnel serving in a particular region). In such scenarios, the goal is to provide high quality, low cost care by exploring diverse clinical data sources for patterns and correlations and by ensuring that significant data for individual patients is not overlooked. For instance, a system of this type could rapidly track the spread of serious infectious diseases such as Hepatitis B or HIV. It could also be used to identify specific soldiers whose demographic profile or past medical history might place them at particularly high risk of contracting the disease.

The systems software requirements for these medical scenarios should be very similar to the requirements motivated by the sensor data fusion scenario, namely employing a client program to direct Java programs to run on the servers for the relevant databases. The data mining aspects of these applications could also require the client to direct Java programs to run on computational servers, which may not be at the same site as the database servers.

2.6 An Example -- Network-based data mining (Rochester)

With large volumes of routine data having been collected, many organizations are increasingly turning to the extraction of useful information from such databases. Such high-level inference processes may provide information on patterns from large databases of particular interest to defense analysis.

Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from such large databases. Data mining is in fact a broad area which combines research in machine learning, statistics and databases. It can be broadly classified into three main categories: Classification -- finding rules that partition the database into disjoint classes; Sequences -- extracting commonly occurring sequences in temporal data; and Associations -- find the set of most commonly occurring groupings of items.

At Rochester, we have worked on the problem of mining association rules in parallel over basket data on both distributed shared-memory multiprocessors and a network of machines. Basket data usually consists of a record per customer with a transaction date, along with items bought by the customer. Many real-world applications can be formulated in such a manner. Our preliminary experimental results on a SGI Power Challenge shared-memory multiprocessor machine and a DEC Memory Channel cluster are very encouraging.

The next step is to consider scenarios where the database server is accessible via the Internet. The machine independent nature of Java makes it ideally suited to writing both the server and client code. The server could be a collection of machines, i.e., it could be a single tightly-coupled multi-processor shared or distributed memory machine, it could be a network of workstations, or any combination of these with heterogeneous processing elements, and heterogeneous interconnection network. The client side could be a heterogeneous collection of machines as well. We need compiler and run-time support to manage both the client and the server. The compiler should be able to insert code to exploit the local network configurations on both ends, while the run-time system monitors dynamic behavior and executes different actions according to the transient information. We have implemented a client-server data mining algorithm in Java. The preliminary results will be discussed in more detail in Section 8.5.

Data mining algorithms tend to have program behaviors quite different from traditional Grand Challenge problems in terms of both data access patterns and control flow. Unlike simple array structures used in scientific code, the data structures used in mining algorithms include hash trees, hash tables, and linked lists. Consider mining for associations: the first step consists of finding all the sets of items (called itemsets), that occur in the database with a certain user-specified frequency (called minimum support). Such itemsets are called large itemsets. An itemset of k items is called a k-itemset. The general structure of algorithms for mining associations are as follows. The initial pass over the database counts the support for all 1-itemsets. The large 1-itemsets are used to generate candidate 2-itemsets. The database is scanned again to obtain occurrence counts for the candidates, and the large 2-itemsets are selected for the next pass. This iterative process is repeated for k = 3,4, ... till there are no more large k-itemsets to be found. Effective parallelization and networked computing require new research in algorithms, compilers and runtime systems.

2.7 Java based Web Server and Client Agents

(An example of conventional Java application/applet needing HPCC -- filtering VRML for terrain data and Visible Human database is one possibility) -- cf: sec. 1.3 SYRACUSE

In many recent applications we have found need for filtering information in Web based applications. For instance in our VRML terrain renderer, we store terrain data in most efficient fashion in an Illustra object datbase. VRML has rather clumsy ways (such as Level of Detail -- LOD) of specifying what data needed and efficient representation. This is much better performed by flexible filters. Thus our model views VRML as display format but could use Java for control agents and filters. The computations can be very large scale and need distributed or parllel resources. An example is the extension of our 2D Visible Human Visualizer to 3D. This is 40 Gigabytes of data (male and female) with both image processing and rendering computations. Clearly HPjava is very relevant again.

3: Models for the HPJava Programming Environment

3.1 General Philosophy and Discussion of Possibilities

(1.2tex, 1.3tex, 4.2ind cf: sec 1.3 above) TEXAS

Java is primarily a compositional and coordination language. That is, the main thrust of the feature set is not computations or data transformations but rather composition of coherent program structures and well-ordered traversal of those structures. A programming environment for Java should extend and complement the language and should provide a framework for construction of large systems which use Java as a composition and coordination language.

There are numerous systems which can used as examples of compositional and coordination oriented programming environments. The conceptual framework of many of these systems is hierarchical directed graphs. This model is consistent with an objet-oriented approach as well. It also allows for specialization to application domains as well.

The major features should include:

a) A seamless interface for integrating models from across the full Internet into programs as they are composed.
b) A debugger which animates at both graphical and text language levels.
c) Capabilities for specifying and filling out templates for common program constructs. In particular templates for 1-n communication, n-1 communication, etc. should be included. A reperoire of common interaction protocol templates should also be included.
d) A powerful capability for expressing complex coordination structures by composition of simple elements. (More templates)

3.2 Object and Applet Parallelism

(1.2tex, 1.3tex, 7.4.6ind) TEXAS

3.2.1:a) Requirements

i) Java will incorporate modules which implement HPC-based simulations of physical systems but these modules will be programmed in other languages and linked together with Java programs. Java programs may, however, need to interact with these modules on a fairly fine-grained level.
ii) The model of parallel computation should focus on specification of interactions among modules and should not impede composition of program structures.
iii) The parallel structuring capabilities should be scalable in term of the number of objects to be controlled.

In sumary, object and applet level parallelism in Java will be used to specify interacting sets of intelligent agents.

3.2.2: b) A Proposal for Additional Specification of Parallelism (Actually agent coordination) for Java

i) Add an "Interactions" section to the definition of a class. The "interactions" specifications will give the conditions under which each method can be invoked or should not be invoked. This can be viewed as an extension of the "synchronized" method attribute. The interactions could be specified in terms of the data state (guard) or execution state (event expression).
ii) Extend the Interface Specifications to include any interactions which can take place under parallel execution, i.e. access to shared data or shared resources during method execution.
iii) Include a dynamic model of communication including multicast primitives.

(This topic will be discussed in a separate section.)

3.3 Task and Control Parallelism

(7.3.1ind) INDIANA from first draft

3.3.1: From 7.3.1)Limited Synchronization Capabilities of current Java:

Following the exact semantics of Fortran-M channels might be a challenge in java because of the limited synchronization features of java. See the Syracuse discussion in 1.4.5 of this draft.

3.4 Data Parallelism

(1.6tex, 1.6ind) (UMD)

There are a variety of ways in which Java could be used to support data parallel computing. We assume that Java may be extended in one of a variety of ways. In the following, we are also regarding any closely coupled set of machines as a parallel architecture.

Java could be used as a mechanism for invoking data parallel codes on remote compute or data servers. For instance, assume that a user has access to a number of parallel machines. On each of these machines, the user has already compiled a parallel code. The user would make use of an agent written in an extended version of Java to find a machine with the necessary available computational resources or necessary data, move the required data to that machine, run the data parallel program and move the computed results to the users machine or to another designated location.
If the user's code has not already been compiled on a given parallel machine, a Java agent could migrate the parallel source code and compile the user code on the remote machine. This scenario assumes that all necessary compilers, runtime libraries, etc. are either already available on the remote parallel machine or that the agent is able to install the necessary system software.
A Java code could be used to invoke data parallel objects and libraries on a parallel architecture. Assume that a known set of parallel libraries and parallel objects are available on each machine. Data parallel computing on a given machine would be carried out by running a copy of a Java bytecode program on each processor (SPMD). The SPMD Java program would make a series of calls to parallel libraries and objects.
A Java code could carry out a set of computations that make use of several different parallel machines. Java agents would target a collection of parallel architectures. The Java agents would identify necessary parallel libraries and objects on each architecture (and if necessary, install libraries and objects that were not already available on a given machine). Java agents would coordinate the computations carried out on the different parallel architectures as well as make use of communication libraries (such as Maryland's Meta-Chaos library) that are able to directly move data between parallel programs, even those running on different parallel architectures.
The Java language and Java compilers could be modified so that users could employ an extended version of Java to write data parallel programs. In this scenario, it would be necessary to define a data parallel dialect of Java. It appears that many of the lessons learned in developing and implementing High Performance Fortran would carry over to Java. The task of implementing a data parallel Java is simplified relative to what would be required for a data parallel dialect of C or C++, because the Java language does not include explicit pointers.

3.5 Remarks from Cooperating Systems on HPJava

As always, parallelism and performance trade off against security, portability, and code reuse. Initial HPC Java research should focus on quantifying these tradeoffs, identifying the software "markets" which can profit from the availability of HPC Java under each set of assumptions, and prototyping HPC Java tools and applications accordingly.

What we cannot afford to do at this stage is rule out alternative candidates for the software development market(s) and model(s) that we expect will prevail. At this stage in the evolution of HPC Java, we cannot say with any certainty whether the browser-level portability provided by Java is worth preserving in HPC Java, what level of HPC Java security represents a good tradeoff, what type of concurrency and parallelism should be supported/optimized/ignored, etc.

We suggest an evolutionary approach to develop HPC Java. We can quickly put together a stage 1 prototype which would allow us to experiment with applications such as those described by Syracuse and Maryland. Atop this basis, each group can pursue their own interesting ideas, based on local assumptions about the market for HPC Java software, which will yield other prototypes which can be evaluated and placed in larger context by the Consortium, and then validated by the market.

3.6 HPC Java v1.0 from Cooperating Systems

An HPC Java environment has two essential parts:

(1) HPC support within a given Java Virtual Machine (VM), and
(2) HPC support across multiple Java VM's.

HPC Java 1.0 can take the most straightforward approach to each of these parts, assimilating the most obvious historical precedents into a Java context: large concurrent data structures within a single Java VM, CORBA/ORB support for task-parallel coordination of remote Java processes, and management of large distributed data structures. These obvious first steps would simply identify some common infrastructure for follow-on experiments by HPC Java Consortium members.

3.6.1: HPC Java within a Java VM

We offer three observations:

Parallelism at this level relies on vendors' SMP implementation of a Java Virtual Machine.
Performance at this level relies on solving pure engineering problems: just-in-time code generation, efficient native threading, and intelligent (discriminate) use of Java's synchronization mechanisms.
All of these are highly likely to be resolved by the market (e.g., Sun, SGI, Borland, etc.), without our assistance, in the near future.

3.6.2: HPC Java across Multiple Java VM's

The market has produced an explosion of "obvious" translations of existing art to Java contexts. Remote procedure call, remote method invocation, object request brokering, etc. are all "solved." What remains to be worked out are the higher-level policy issues and services which will elaborate this basic infrastructure. Prototypes derived by HPC Java Consortium teams can assess the impact of multi-VM Java on security, performance, distribution, maintenance, and reuse of investments in HPC applications.

3.6.3: Foreign HPC Code

HPC Java code may consist entirely of Java for portability's sake, or foreign codes may be dynamically linked in, either (a) locally within one Java VM as shared libraries of native code, or (b) remotely among Java VMs and foreign objects, by using some kind of ORB for remote method invocation and namespace management.

Option (a) represents a straightforward trade of lost portability and security for increased performance and access to HPC libraries; option (b) takes advantage of the same existing interobject standards which form the basic infrastructure for interJava coordination. In section 4 we argue that providing direct Java support for large data structures is a reasonable alternative to simply using Java as the gateway to their foreign (native) versions.

4: Possible Java Language Enhancements and Support Classes for HPCC

4.1 Current Java Features for Concurrent and Distributed Computing

(1.4tex, 1.4ind, 1.5ind, 1.7ind) CSC from first draft

4.1.1: From 1.4tex) Current Features for Specification of Parallel Execution Behavior

Java currently has explicit threads of control and the "synchronized" property for methods as its means of specification of concurrent behavior. Attaching the "synchronized attribute to a method cause any execution ofa method on an object instance to be blocked if another method is active against the instance. This is equivalent to a "lock" on the instance.
Use of threads without adequate control over scheduling and sequencing is fraught with potential for inconsistency.
The "synchronized" attribute for modules can lead to deadlock and induces the "inheritance anomaly." The inheritance anomaly occurs when an inheritance hierarchy which is valid under sequential execution becomes invalid under parallel execution. (Note that under sequential execution methods interact only at the times of invocation and termination while under sequential execution methods may interact at any location in a program where a direct or indirect reference to a shared or potentially shared resource occurs.) The inheritance anomaly occurs when a redefinition of a computational specification in a method leads to a change in its interactions with other methods under parallel execution. It is easy to generate examples of the inheritance anomaly in Java.
PREDICTION - Any extensive use of the current Java with any degree of complex parallel structuring will lead to non-trivial problems. The current mechanisms are not scalable even without concern for integration of HPC- based modules.

4.1.2: From 1.4ind)Current Features of Java

Transparency being one of the key factors behind Java, any model of parallelism should not pose that great of a problem to Java.
That given, as you all know, there is already a model of parallelism based on the thread class library. However, this is based on a shared object name space model. It is not yet clear if any real multiprocessors allow true multithreaded implementations of the java abstract machine.

4.1.3: From 1.5ind)Extended Synchronization in Java:

Synchronization is based on the use of "synchronized" methods. For large scale parallel applications it *may* be necessary to add more synchronization types .

4.1.4: From 1.7ind)Current Strengths and Weaknesses of Java:

In general it may be best to strengthen Java's existing parallel and distributed computing mechanisms rather than impose data parallel constructs on the language. In our opinion, the areas to strengthen are:

a) the thread model. make sure it is possible to create lots of threads efficiently (an array of thread class objects for example) and make sure the Java abstract machine can handle them.
b) object name spaces and remote objects. java needs a model that allows an aplet to talk to an object running on another compute server. <more on this in 7.2 below>

4.2 Language Enhancements

(1.5tex, 1.5ind, 4.2csc, 4.3csc, 4.2ind ) CSC from first draft

4.2.1: From 1.5tex) A Proposal for Additional Specification of Parallelism for Java

a) Add an "Interactions" section to the definition of a class. The "interactions" specifications will give the conditions under which a method can be invoked or should not be invoked. This can be viewed as an extension of the "synchronized" method attribute. The interactions could be specified in terms of the data state (guard) or execution state (event expression).
b) Extend the Interface Specifications to include any interactions which can take place under parallel execution, i.e. access to shared data or shared resources during method execution.
c) Consider a dynamic model of communication. This topic will be discussed in a separate position paper.
These extensions are fully consistent with the philosophy and features of Java but add greatly to its power for coordinating parallel interactions without interfering with composition and reuse.
This proposal is intended as a thought piece. The details may differ but the principles are sound.

4.2.2: From 4.2csc) Extensions to the Java language.

Three categories:

(a) syntactic sugar for obvious dataparallel constructs,
(b) annotations to flag concurrency and suggest distribution strategies, and
(c) deep language additions to extend Java's support for explicitly managed concurrency.

These three alternatives generally mirror the history of proposed C++ language extensions.

Pros: Language extensions give designers more freedom to set "optimal syntax" for new parallel constructs, and enable some optimizations for the implementation of those constructs.
Cons: Obviously, the need to distribute a HPJava compiler, and the obligation to make that compiler (and hopefully its associated debugger) track the rapid evolution of the Java language and the Java virtual machine. It may not be too difficult to come up with a prototype. But on the Web, people expect more than research prototypes. Users will demand assurances (whether reasonable or not) that the additional HPJava compilation process will be at the same level of production-quality as the underlying Java compiler, and that the new HPJava features will not degrade Java's security or restrict the portability of their code.

4.2.3: from 4.3csc) Implications of Immature Language:

Because the Java language will continue to evolve, any preprocessor or compiler approach will have to track Java's evolution. There's a danger that a real and robust compiler cannot be delivered quickly, and that by the time HPJava extensions are defined, implemented in a preprocessor, and distributed, those tools will be at least one generation out of sync with the latest Java standards. Debugger support for Java is already fairly poor, even compared with existing HPC languages, so adding nonstandard extensions can only make the situation worse by confounding Java source/object code correspondence.

4.2.4: From 4.2ind)Implications of Java Philosophy:

One of the principles in the design of Java has been simplicity, Rajeev feels that language annotations may go against the original philosophy. Gannon is not sure.

4.3 Class and Interface Support

(4.1csc, 4.5ind) CSC From first draft

4.3.1: From 4.1csc) Extensions via class libraries.

Use Java's OO features to add new functionality in the form of redistributable Java classes.

Pros: Most significantly, no need to distribute a new Java compiler or preprocessor, and no need to modify the "standard" process for compiling and linking Java code. Because Java is class-extensible by design, library-based approaches shorten the development time for Java extensions.
Cons: Java lacks operator overloading (but history may ultimately view this as a gain for readability) and templates. As a result, library-based approaches may result in inherently cluttered code (no sugared syntax for the obvious data-parallel constructs). Lack of templates may make it hard to define collective classes with polymorphic base types, but the fact that all Java objects derive from Object may still make possible polymorphic approaches with reasonable overhead. Library-based extensions may defer some program analysis to runtime which could have been performed once at compile time.

4.3.2: From 4.5ind)High Performance Classes:

As there are lots of standard classes (java.net, java.awt, java.util, etc.), it is very intuitive to create high performance classes which can be precompiled and imported in the user programs. Also, users can extend these standard high performance classes to suit their needs.

Various data distribution schemes (supported by HPF) can be implemented as part of these high performance classes.

4.4 High Performance Array and Other Java Classes CSC

Task level concurrency is already supported by Java. Arrays of various objects or data types can be transparently supported by a Concurrent Array package. A concurrent array can be implemented in a straightforward way by mapping subgrids to threads; CSC has developed trivial applet codes that implement block-partitioned arrays using this technique.

Performance will initially be quite bad (an order of magnitude worse than compiled code one one processor, plus inability to exploit multiple processors on (nearly?) any SMP platform). But as we observed in sections 1.8csc and 1.9csc, these are purely engineering concerns which will be fixed by the market without our intervention. Neither represents a long-term obstacle to high performance java applications, and the bottom-line allure of universal portability and easy maintenance will help close the gap.

We are led to adopt the minority opinion, therefore, that native Java array classes are one worthwhile target for prototyping by the HPC Java Consortium. They represent one component of HPC's history which can be immediately and intuitively ported to a Java context, and combine reasonable performance with universal portability.

5. Compilation/Interpretation Issues

5.1. Compiled vs. interpreted

(6.1rice, 6.1ind, 5.1roc ) RICE from first draft

From 6.1ind) Performance Issues:

Performance is the issue and hence, shouldn't client as well as server have efficient compilers?

5.2. Compilation issues

(6.2rice, 5.2roc) RICE from first draft and ROCHESTER

6.2rice) In First Draft lists topics:

Analysis Techniques

On-the-fly Compilation
Back-end Optimizations
- Register Allocation
- Scheduling
Optimization vs. Garbage Collection
Preserving Security

Section 5.2 (Contribution from Rochester) Compilation Issues

The current Java compiler takes the hybrid approach of compilation and interpretation. At Rochester, we would like to exploit more options to generate efficient code for High Performance Java applications. We consider three scenarios:

(i) Compiling Java to bytecode.: In this approach HPCC Java could be interpreted. The compiler would generate Java VM code suitable for HPCC problems. In addition to HPCC-specific optimizations, this may require generating (or applying user-provided) annotations. There is no consensus yet, whether using HPCC-specific annotations is desirable.
(ii) Compiling Java to native code.: Here, we would treat Java programs in the same way as we treat programs in other languages (C++, Fortran). Due to the dynamic nature of Java, different approaches are possible when all source files are available and when we compile one class at a time. In the first case we may be able to eliminate some overheads (e.g., type checks, array range checks, virtual method invocations). In the latter case, only local optimizations would be possible.; This scenario gives us the most flexibility, but is not acceptable in some circumstances. In "network computing" we want applications to be distributed in the bytecode format. In that case we have to use approach (i) or (iii) or a combination of both. An orthogonal issue is how to handle java class libraries, which (if any) are implemented implicitly in the interpreter, these will probably have to be implemented in some HLL or directly in native code. This is also an issue in scenario (iii) below.
(iii) Compiling bytecode to native code.: This seems to promise a lot of power. We expect that interpreting the bytecode (even with quick variants of VM instructions or localized just-in-time compilation) will be inherently slower than running native code. Our early experiments with data mining in Java have confirmed this observation (cf. Sec. 8.5). This is especially true if we can compile a large part of the application (ideally, the whole application) at the same time. Then, we can apply optimizations like inlining or compile-time type checking which otherwise would be available only in scenario (ii).; A variant of this technique would transform bytecode to bytecode with the addition of quick, native methods for common operations and native implementation of common data structures (cf. Sec. 7.1). Such data structures may include arrays or lists. Such a data structure could be implemented with the internal representation which could be used directly by native methods implementing most common operations. An interface would be provided to access the data structure from Java.

At Rochester, we are implementing compiler infrastructure with multiple front- and back-ends which will uniformly treat all scenarios mentioned above (cf. Section 8.4). We expect to include traditional optimizations like cache optimizations, register allocations etc.

5.3. Interaction with runtime

(6.3rice) RICE from first draft

Function Dispatch
Coexistence with Interpreted Code
Method Caching
Dynamic Updating of Applications
Source-to-Source transformation

5.4. Tools interface

(6.4rice) RICE

6. Runtime Support

6.1. Dynamic communication and distributed nameservers

(7.1.1tex, 7.1.2tex, 7.1.3tex, 7.1.4tex, 7.1.5tex, 7.1.6tex) TEXAS

6.1.1:a). The Current Model of Communication

The model of communication for Java is implicitly defined. It is based on assigning a unique name to every communicating agent and/or every object used in communication.

The benefits of using this model of communication are that is familiar and that there is a large technology for support of implementation of this model of communication.

6.1.2:b). Desirable Properties of Model of Communication for Java

i) It must be simple and uniform.
ii) It should give convenient support for communication among dynamically constituted sets of agents.
iii) It should support continuous dynamic mapping of Java implemented agents and objects to dynamic resource configurations.
iv) It must be consistent with the philosophy and current features of Java and implementable in a consistent syntax and semantics

6.1.3:c.) Proposed Model of Communication

The proposed model of communication is state-based communication. The elements are as follows.

i) Each class will have another declaration called a "profile" declaration. The profile declaration will define the set of communications which an instance of the class is prepared to accept at a given time.
Profiles are dynamic and reflect the state of each object at a given time. The profile declaration includes a method for initializating and modifying profiles.
ii) Each set oriented communication will have two parts; a selector and some data. A selector is a template for a predicate over state variables which will occur in profiles.
iii) A communication is initiated by an object creating a communications object which includes a selector and the associated data. Creation of the communication object invokes an associative broadcast of the communication object. An associative broadcast determines all of the object profiles which match the selector and delivers the data to them.

A detailed definition of associative broadcast can be found in Bayerdorffer [BAY94]

6.1.4:d). Integration/ Implementation Concepts

This model of communication can readily be integrated into the concept framework for system services provided by Java. One possible integration follows.

i) This model of communication can be implemented in Java using the "implements" attribute in the same mode as is done for threads wit the class "Runnable."
ii) The interpretive model of execution allows for ready mapping of communication objects to mechanisms appropriate to local or remote communication.

6.1.5:e). Comments

i) This model of communication is familiar. It is run-time implementation of a name server. Linda [GER86] uses a model of communication in the same class of capabilities but with a different set of implementation concepts.
ii) Distributed resource management algorithms such as fault recovery, reconfiguration, addition and deletion of system resources are much simpler in a state-based model of communication than in a unique name- based model of communication.

6.2. Java and Distributed Object Systems Standards Including CORBA

(7.2.1ind, 7.2.2ind, 7.2.3ind) INDIANA

In this section we review the Common Object Request Broker Architecture (CORBA) standard for distributed object oriented systems and discuss the work that has been done on integrating CORBA with Java and what needs to be done in the future for HPJava and a future high performance version of CORBA that supports a parallel execution environment.

6.2.1: Distributed Object Oriented Systems

Common Object Request Broker Architecture (CORBA) One prominent characteristic of a distributed environment is heterogeneity. Different computers running a diverse spectrum of operating systems make inter-operability and security very important and difficult to achieve. In order to overcome the multiple platform problems, OMG (Object Management Group), a consortium of commercial vendors, jointly developed a common set of rules which define the way distributed objects should interact. OMG, which had 12 members initially has grown to more than 500 hundred members. CORBA (Common Object Request Broker Architecture), the architecture specified by OMG, has become the de-facto standard in the distributed object world. One noticeable exception being Microsoft, which has its own standard, COM and OLE. CORBA forces a clear distinction between the interfaces and the implementations of the objects. CORBA interfaces are specified in a language called IDL (Interface Definition Language). The implementations of the objects can be written in any language, as long as there exists a proper mapping between IDL and the implementation language.

Two main features of CORBA are inter-operability between applications on different machines in heterogeneous computing environments and seamless interconnection in multiple object systems. A client object, residing on one machine, may request that an operation be performed on an object whose implementation resides on a different machine. This is accomplished by making use of the Object Request Broker (ORB) which is responsible for all mechanisms required to find the object implementation, to prepare the object implementation to receive the request, and to communicate the data comprising the request. The client sees an interface that is independent of the location of the object, the programming language used for the implementation and any other aspect not visible in the interface. The object model described by OMG is a client-server model where a client requests a service from an object by sending a message to the object. The object interprets the message to decide what service or operation to perform. The object may or may not return results to the client via the ORB. The client identifies the server object by means of an object reference and the service by means of the operation to be performed on the identified object. If a requested object implementation is missing, the ORB generates an exception. OMG has also specified a IIOP (Internet Inter-ORB Protocol) for communication between different ORB implementations.

6.2.2:CORBA Implementations

Since the conception of CORBA by OMG, many commercial vendors have produced CORBA implementations. Iona's Orbix, PostModern's Orbeline and IBM's DSOM are a few examples of commercial implementations. In order to implement the CORBA architecture in a specific language, a mapping between IDL and that language is necessary. OMG has specified a mapping of IDL onto C/C++. The choice of C/C++ is not surprising because IDL is very similar to C/C++.

It is obvious that ORB is the most critical component of the CORBA architecture. It offers transparency to the application programmer and/or is responsible for tackling unexpected events such as hardware failures. In an environment running critical distributed applications, the ability of ORB to detect and recover from the unexpected failures is extremely important. Each of the commercial implementations support CORBA specifications and employ simplified fault detection and recovery strategies.

6.2.3:Java and CORBA

Java achieves inter-operability via the virtual machine. Applets are downloaded from servers onto the client machines as and when needed. The bytecode verifier checks these applets. If these are security compliant, then these are executed. In case of any security violations, appropriate exceptions are thrown. This mechanism forces two restrictions -- a client cannot initiate a remote invocation (or partial processing) of objects and the clients and servers are implemented in Java. However, these restrictions do present a secure framework for distributed computations.

CORBA, on the other hand, by separating the implementation from the interface, allows true inter-operability. The client and server implementations do not really matter at all, as long as the interfaces (or contracts) are clearly defined in IDL. The ORB relieves the client from locating, downloading and executing the server object on the local machine. Only the results of remote invocations are sent back to the client.

As CORBA does not restrict the choice of the implementation, it is obvious that objects can be implemented in Java. As long as a proper IDL to Java mapping is defined, Java objects can talk to any other CORBA objects and achieve inter-operability. This symbiosis enhances Java's model of computation with the expressibility of CORBA and at the same time the simplicity and robustness of Java can make the object implementations an easy task. Java has a C++ appearance and hence, IDL to Java mapping is a natural extension of OMG's IDL to C++ mapping.

All the commercial CORBA vendors have either developed Java ORBs or have announced their plans of introducing ones in the near future. PostModern has already developed BlackWidow, which is distributed free for educational institutions. BlackWidow includes development tools for building distributed Java applications. Applications written with BlackWidow execute inside Java enabled browsers. It also offers fault tolerant capabilities. BlackWidow applications communicate with C++ applications developed with ORBeline.

In addition Sunsoft also provides a CORBA ORB and IDL system for Java called JavaIDL and Joe which is a CORBA compliant object system for Solaris.

Iona Software has anounced Internet Object Request Broker which is a another commercial Java-CORBA system to add to their large CORBA software product line.

We experimented with the first version of BlackWidow on a cluster of Solaris machines. Programs with client and server implementations written in Java worked successfully, as long as the client and server objects were on different physical machines. In case of client and server objects running on the same machine, the broker produced an exception message indicating the absence of a server implementation! The problem was traced to the JDK (Java Development Kit). In order to experience the true inter-operability, the client implementation was written in C++ and the server was implemented in Java. The results were similar -- as long as the client and the server were running on different machines, the experiments were successful.

As an initial test of the existing Java-CORBA interface we are now integrating a Java front end to a distributed CORBA based collaboration tool called the Collabratorium. The Java tool allows a user at a remote site to be authenticated and check out and check in documents from a distributed repository.

6.2.4: HPJava -- What Should be the Influence of CORBA?

The motivating applications for HPJava, such as DIS, have an inherent degree of distribution. This makes remote invocations (as well as partial processing) a must. That leads to two possible options:

extend the built-in execution model of Java to allow remote object invocations or
use Java objects in conjunction with an appropriate ORB to achieve parallel processing.

The former approach can be implemented either by extending the syntax of Java (which may go against the simplified nature of Java!) or by using "remote invocation" classes. The later option is more direct, but needs a presence of a "reliable" ORB and a Java front end to it. Either of these options will allow the possibility of security violations. Consequently, we need to consider the addition of a distributed authentication system which works with the Java safety mechanisms.

One of the major shortcomings of CORBA as a model for highly parallel distributed computing is that the mechanisms for collective operations are non-existent. Extension to CORBA to allow for a type of ``collective object'' have been defined and are being prototyped now. Collective objects have the following properties.

A collective object is shared by a "registered" group of client objects.
Member functions of a collective object can be like normal CORBA members, or they can be "collective". Collective members allow a set of client objects to invoke the member function as a group. For example, reduction, broadcast and parallel prefix operations are collective operations. All members of a group must invoke the member before any member gets the the value returned.
A value returned from a collective operation may be a "future" value. That is, a future is a promise for the delivery of a value. By returning a future to the calling client object, that client can overlap computation with the time it takes a collective object to compute and return a value.

The collective object concept is easily extended to an interface type in HPJava. In addition, the introduction of futures into HPJava would allow multi-threaded applications much greater flexibility and potential concurrency than the existing Java "synchronized" class members.

6.2.5: HPJava -- Additional Observations

Another shortcoming of Java is that there is very little support for defining aggregate types. In the HPC++ project, the C++ standard template library has been extended to support the use of STL containers to describe distributed collections of objects. A simple type extension to Java would allow a interface for aggregates of "runnable" objects or aggregates of proxies to remote objects. Such an extension does not have to follow the C++ template style. The Pizza proposal shows an elegant solution to this problem.

6.3. Low level runtime issues -- PCRC Distributed Memory

(8.1.1syr, 8.1.2syr, 8.1.3syr, 8.2.4ind) SYRACUSE (Including Ranka at Florida)

6.3.1: PCRC Interface issues

We consider coding a PCRC-like high-performance runtime system (including Parti/Chaos) in Java. There are two basic interface issues:

1)The interface with the calling language.

Is the library supposed to be callable from languages other than Java e.g., Fortran with message passing (HPF) or C/C++ with message passing (HPC/C++)? This presents some problems:

How to manipulate the externally declared data arrays from within Java. Without low-level pointers this could be problematic.
Is the Java VM restarted whenever Java routines are entered, or is does it continue running in the background, somehow, whenever control returns to the calling program?

2)The interface with the underlying process control and communications library, e.g., PVM or MPI.

The new PCRC library is likely to consist of around 150 routines for regular computations, and about 60 Maryland CHAOS routines for handling irregular cases. A pre-requisite for porting these codes to Java is that a smooth interface between Java and an underlying communication system (e.g. MPI) be established.
At present, certain computational functions may have higher performance in other languages: e.g., MATMUL may be better left in Fortran, since Fortran still has better number-crunching performance, and more easily handles situations like COMPLEX data types.

6.3.2: Performance Issues

Java can be used in high performance environments in the following ways:

1. Used as a mechanism for invocation of codes or libraries already compiled on other machines.
2. Used as a mechanism for writing machine independent code by running Java bytecode on every processor.

Option 1 has potential portability problems while option 2 has potential performance problems. One of the main reasons for using a parallel server for data parallelism would be to achieve high performance. This is especially for the low level runtime library. Hence, an important issue to study would be the performance tradeoffs of using Java as compared with other contemporary languages for array manipulation routin es required for local computation (both for using the interpretive mode as well as compilation into native machine code).

The interface between Java and low level communication systems must be better understood and benchmarked. For instance, Java now can call a C function when that function has been specially written to be called by Java. We need to experiment and evaluate how the Java runtime environment can collaborate with the MPI runtime environment.

6.3.3: Advantages of the Java language

One advantage of coding in Java is the built-in multithreading. A run-time library to support full HPF should provide demand-driven access to remote memory. (A clear example is the case where multiple instances of a PURE function are executing in parallel in a MIMD fashion inside a FORALL. HPF allows these functions to asynchronously read distributed arrays in a common block. Similar situations could arise in translation of independent DO looks.)

This is a situation where a background memory-server thread is required on each processor. There a number of other situations in HPF translation where the library is simplified by use of multiple logical threads.

Non-premptive threads can be simulated if they are not provided by the language (this is explicitly done in the run-time library of the Southampton HPF system, for example). Using ``real'' threads would simplify the implementation.

6.4 Low level runtime issues -- Threads (Rochester)

(8.2.1roc, 8.2.2roc, 8.2.3ind)

Threads in Java are supported for concurrency. This is useful as perceived by a user, when response time is important, e.g. the HotJava browser where you can scroll a page while down-loading an image. It currently does not support threads running on separate processors. Our main objective in this section is to support parallel threads, at no expense to the Java programmer. We would prefer not to change, modify or limit the language as perceived by the user.

There seem to be two possible scenarios:

1) Replace the current Java thread package: implement a completely new native threads package for each architecture to be supported. Use this package instead of the Java thread class.: The advantage of this approach is that no special compiler support is required. It is less susceptible to changes in the language. The disadvantage is that this might not be possible. If the interpreter actually implicitly "implements" threads where it just branches from one context to another on a yield or a scheduling call, then replacement might not be feasible. This obviously requires a more complete knowledge of how the interpreter behaves w.r.t thread library calls.
2) Add a new parallel thread package: implement a parallel thread package for each architecture that is independent of the Java thread class. Maintaining the original classes may be useful for relatively cheap multi-threading on a single processor and also we may not have a choice!
The advantage is that we may be able to implement the parallel threads mechanism independent of the interpreter to the extent that a Java compiler can insert calls to the parallel threads package directly when appropriate. The disadvantage is that the compiler has to do more work distinguishing between parallel threads and normal Java threads and when it is legal to replace the latter with the former.

At Rochester, we are particularly interested in implementing a parallel thread package for Java with special compiler support on distributed shared-memory (DSM) machines. On scalable multiprocessors, communication is much more expensive than computation. Many different protocols have been proposed for behavior-driven locality management. None of them work best for all types of data sharing. Performance would be increased if the system could adapt to the program's memory access patterns. An example could be the use of multithreading to alleviate memory latency (especially on remote accesses) aggravated by delays introduced by the memory coherency protocol. We propose to analyze programs at compile time and annotate them with data access summary information. Exploiting this information will require the ability to switch among multiple data coherence protocols at run time.

7. Architecture Issues for Integration of Java with HPCC

7.1. Shared Memory Architecture Issues

(5.1roc, 5.2roc, 5.3roc, 5.4roc) ROCHESTER

A shared-memory programming model provides an easily accessible programming paradigm for users. Many future machines will support shared-memory, from small-scale multiprocessor workstations, tightly-coupled distributed shared-memory multiprocessors, to software-based virtual shared-memory over a network of workstations. Compiling Java programs on shared-memory architectures is critical for high-performance Java programs.

We foresee two basic ways to compile Java programs on shared-memory architectures (a combination of those two may also be an interesting option).

(i) Make the interpreter aware of the underlying implementation. One may expect that computer vendors will implement customized interpreters for their parallel machines. Such interpreters could either run Java threads in parallel or exploit parallelism in some other way.
(ii) Compile Java programs to native code and run them directly as parallel applications. Note that to do that, we do not have to compile Java source to native code directly (cf. Sec. 5.2). As a matter of fact we don't need access to the source at all -- it is possible to compile bytecode programs by either using just-in-time compilation, or by compiling them once into a.out file which can be run many times (this is a j2c-like approach).
(iii) A combination of those two techniques may use precompiled parallel versions of some common code fragments. We can provide native versions of whole classes which would be drop-in replacements for Java classes. We can also implement "parallel versions" of common data structures like arrays or lists. If the internal representation of those data structures was the same as in programming languages traditionally used for HPCC, we could leverage existing libraries with parallel implementations for shared-memory machines.

There are a number of shared-memory related issues which should be addressed for each of the approaches mentioned above. Locality optimizations would improve average memory latency for machines with memory hierarchies. Locality properties can be optimized by using data and loop transformations. We envision many compiler-time optimizations that will reduce the cost of communication by eliminating communication or overlapping communication and computation. Examples include {\em eager sending}, in which a producer thread injects newly-created data into the (remote) cache of a consumer thread as soon as that data is ready, and {\em self-invalidation}, in which a thread that does not expect to use data again before it is modified voluntarily removes the copy from its local cache, eliminating the need for later invalidation. Data access and computation summaries can also be generated at compile time, to help the run time system. Dynamic load balancing, Adaptive memory coherence, are two examples which can benefit from this type of information. This will be built on the current compiler project for DSM at Rochester.

7.2. Heterogeneity

(5.5roc, 5.6ind) INDIANA from first draft

From 5.5roc) Compiling Java for heterogeneous networks of machines:

There is heterogeneity in various aspects of parallel programming: program, processor, memory and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; heterogeneous memory refers to the different amount of user-available memory on the machines; and a heterogeneous network has different cost of communication between processors. We need both compiler and runtime support.

From 5.6ind)General Issues:

Interoperability/transparency is a strong point behind design of Java (creation of bytecode, etc), HP Java should well suit any target architecture.

Different precompiled classes could be a way to tackle this.

7.3. Fault-tolerance and Reconfigurability

(7.1.6tex) TEXAS

A major differnce between traditional HPCC applications and the Super Grand Challenge Applications which will result from integration as described in Section 1.3 is that they must continue to execute in the presence of reconfiguration of resources and agents and in the presence of faults and user errors. Fault-tolerance is itself a major research area. For Java, however, we need not worry about the semantics of applications but rather only about tolerance of faults which arise from coordination and synchronization errors. This is a much narrower and well-studied subject. The basic technology which is needed can be incorporated into the interaction specifications of Java objects.

Reconfiguration and recovery from faults have highly similar content. The algorithms for reconfiguring across dynamic resource sets are similar to those for continuous operation in the presence of faults. A critical element is a state-based model of communication as described in Section 6.1.

(more later- JCB)

7.4 Security

(7.2.3ind) INDIANA

8. Some Early Experiements.

Details could go into appendices

8.1.Migration and Mobile Programs

(7.4.1umd, 7.4.2umd, 7.4.3umd) (UMD)

8.1.1:Mobile Programs

The University of Maryland is pursuing a project designed to provide a single unified framework for remote access and remote execution in the form of a type of agent we call an itinerant program. Itinerant programs can execute on and move between the user's machine, the servers providing the datasets to be processed or third-party compute servers available on the network. The motion is not just client-server; it can also be between servers. Programs move either because the current platform does not have adequate resources or because the cost of computation on the current platform or at the current location in the network is too high. Various parts of a single program can be executed at different locations, depending upon cost and availability of resources as well as user direction.

The architecture also provides support for plaza servers. Plaza servers provide facilities for:

execution of itinerant programs,
storage of intermediate data,
monitoring cost, usage and availability of local and remote resources on behalf of itinerant programs and alerting them if these quantities cross user-specified bounds,
routing messages to and from itinerant programs, and
value-added access to other servers available on the network.

Examples of value-added functionality include conversions to and from standard formats and operations on data in its native format. In addition to allowing cost and performance optimization, itinerant programs and plaza servers also provide a flexible framework for extending the interface of servers, in particular legacy servers with large volumes of data as well as servers that are not under the control of the user. The University of Maryland has a prototype system that supports plaza servers and itinerant programs. Program migration is currently carried out either in response to an external signal or when a potentially itinerant program invokes the necessary runtime support. We are currently in the process of adapting resource monitoring software so that we can demonstrate the ability to carry out program migration in response to fluctuations in processor and network load.

8.1.2:Coupling Data Parallel Programs

Maryland has developed prototype software to demonstrate methods designed to flexibly and efficiently couple multiple data parallel and sequential programs at runtime. The program coupling software is used to allow the coordinated use of multiple separately developed parallel programs in solving complex problems. The program coupling software will also be used to couple itinerant programs.

Maryland has developed a layer of runtime support (called Meta-Chaos) to establish mappings and to carry out communication between data structures in different sequential or data parallel programs. Efficient data movement is achieved by pre-computing an optimized communication schedule. Our approach is to define a set of functions that each runtime library should export and to build Meta-Chaos on top of this minimal API.

At the level above Meta-Chaos, interaction between programs occurs through a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. A-priori knowledge of consistency requirements by a runtime library allows buffering of data as well as concurrent execution of the coupled applications.

8.1.3:Periodic Data and Computation Servers

The integration of program coupling software into itinerant programs will make it straightforward to develop long-running itinerant programs that process periodically generated data. We anticipate that a collection of itinerant programs will process sequences of data from multiple sources where each source may have a different periodicity. An itinerant program can either visit each of the data sources in order or it can install a surrogate at each data source, which is activated every time the dataset is updated. A surrogate acts as a filter for the sequence of data generated by the data source and communicates appropriate updates to the parent itinerant program (possibly after preliminary processing). For complex processing on a number of datasets, a hierarchy of surrogates and result combining programs can be created. The result of such itinerant programs can either be a single accumulated result or a sequence of results.

8.2. Relevant Experiments at NPAC

(7.4.4syr, 7.4.5syr) SYRACUSE from first draft

8.2.1: from 7.4.4syr) WebWork: Integrated Programming Environment Tools for National and Grand Challenges

This is our joint work with Boston University and Cooperating systems with paper written before implications of Java as clear as they are now!

This was implemented as CGI (PERL) scripts and succesfully applied to factoring on the Web

8.2.3: From 7.4.5syr) Computational Web Multiserver - a proposal for Web based HPCC

This updates some of the ideas in WebWork

Web technology development has so far focused on the client side (Netscape2, plugins, Java applets, JavaScript)

Web servers, so far minimal (e.g. NCSA httpd), are expected to expand now with the growing demand on backend processing (database, collaboratory, imaging, compression, rendering, simulation, televirtuality).

Java offers a promising development platform for computationally extended Web servers. Current activities include:

Java based httpd prototypes
Java based distributed languages such as E by Electric Communities

Java is also an attractive platform for PCRC module packaging and unification of HP languages. Advantages of Java based HPruntime include:

secure, architecture neutral, OO but simpler than C++ framework
runtime modules dynamically downloadable (via portable opcodes)
native class support for performance critical (data parallel) ops
built-in OO support for networking and multithreading
Java compiler written in Java => portable HP extensibility

Computational Web server = Java based Web server with native HPCC kernel support (e.g. for matrix operations), extensible by PCRC modules

HPCC builds on pervasive Web technologies
Web browser based interactive insight into HPCC runtime

Higher, fully interpreted layer with intuitive scripting syntax required for rapid prototyping and high level runtime communication. JavaScript/LiveWire (now formalized/standardized by W3C) is one possible candidate.

However, more attractive approach is to develop high level VM on top of low level Java VM to support variety of interpreted protocols. (One existing example: postfix syntax based low Telescript layer). We call it WebScript and view JavaScript, LiveWire, VRML as specific instances/little languages. The resultant Computational Web Multiserver Architecture is:

runtime given by a mesh of computational web servers
several communication layers possible (HTTP, Java opcode passing, Java class passing, E-object message passing, WebScript passing)
applications written in conventional languages (HPF/C++) or HPJava, HPJavaScript (HPWebScript)
base runtime support server-resident, application-specific runtime libraries downloaded dynamically and shared

Summary of software layers:

a) Java VM
b) base Java classes (as in JDK 1.0)
c) native HPCC classes (compiler directive handlers, matrix algebra)
d) PCRC classes (Java or Java wrappers)
e) WebScript VM (in Java)
f) HP-WebScript instances (e.g. HP-JavaScript)
g) applications

At NPAC, we developed early protoypes of some of these concepts (with ARPA support) and demonstrated at SC'93. MOVIE system was an early example of computational "web" multiserver ("web" didn't exist before '93..). MOVIE + F90D based HPFI was an early example of high performance interpreted computational multiserver. More specifically, the layers a) - g) above were implemented as follows:

a) MOVIE/TCE kernel (C with object-based organization)
b) built-in (C coded) objects (string, dictionary, hashtable etc.)
c) matrix algebra objects (fields = polymorphic n-dim arrays)
d) F90D runtime, wrapped and linked to MOVIE via UNIX shared mem
e) MovieScript VM (part of MOVIE kernel)
f) HPMS (High Performance MovieScript = MovieScript + interpreted F90)
g) HPF test applications, converted by modified F90D frontend to HPMS

8.3. At Indiana -- Indiana Responsible for Name

(8.2.5ind) INDIANA from first draft

From 8.2.5ind) Implications of Current Indiana Research on HPC++ and CORBA:

Kaehey's Collective Shared Objects are objects that live on the net. they can be distributed or localized on a particular server. These object can have data members which may also be distributed or localized and they have two types of member functions: regular and collective. regular ones are norman member functions like standard java. collective member functions implment collective opeartions such as reductions, scans, broadcasts between participating clients.

A client applet must "register" with such an object before it would be able to talk to do collective member operations. It turns out that this is an interesting way to model distributed data bases, metachaos style shared arrays. it was the essential component of our I-way project for sc 95.

8.4 Early Experiments -- Java Compiler Infrastructure (Rochester)

As part of our effort to better understand the compilation issues, at Rochester we are developing a Java compiler which could be used to optimize Java programs.

The compiler would use a common intermediate representation and a collection of front- and back-ends to facilitate re-use of components and independence of optimization passes.

There are two possible front-ends: one would read Java source, the other one would read bytecode.

A set of back-ends would include Java source generation, bytecode generation, and "native" code generation (either in form of some target HLL like C, or actual native code for some architectures).


                                  /=====> Java
Java       ----\                 /
                ======> IR  =====-------> bytecode
bytecode   ====/                 \
                                  \-----> C (or other "native")

Our current development concentrates on a bytecode front-end and Java back-end (shown above with double lines). The choice was made, because using bytecode as an input is more general (if we have access to source, we also have access to its bytecode form -- the converse is not true). Using Java as output has a big advantage in the development phase.

The implementation is not complete: only a subset of legal bytecode constructs is implemented and the definition of our IR is not stable. However, preliminary results are promising. We are able to identify implicit high level constructs such as loops with good success. Also our high level IR makes implementation of code transformation relatively simple.

Code instrumentation is our first application of the compiler framework. We are developing a pass which inserts instrumentation calls. Currently we are instrumenting loops, if-then-else statements, and methods. We are planning to look into the performance of a range of Java applications. The objective is to determine if there are any performance issues that are unique to Java the language (the fact that it is interpreted), and/or applications written in Java (it could be the case that only certain kinds of applications are worth writing in Java).

An Example.

We are given a bytecode representation of a method corresponding to the
following disassembled code.

Method void paintAll(java.awt.Graphics)
   0 iconst_0
   1 istore_2
   2 goto 19
   5 aload_0
   6 aload_0
   7 getfield #7 <Field PCRCexample.CellList [I>
  10 iload_2
  11 iaload
  12 aload_1
  13 invokevirtual #6 <Method PCRCexample.plotCell(ILjava/awt/Graphics;)V>
  16 iinc 2 1
  19 iload_2
  20 aload_0
  21 getfield #4 <Field PCRCexample.n I>
  24 if_icmplt 5
  27 aload_0
  28 aload_1
  29 invokevirtual #3 <Method PCRCexample.updateStatusLine(Ljava/awt/Graphics;)V
>
  32 return

We can parse the class file and recover a high-level intermediate representation which corresponds to the following source (this was generated automatically by our Java-source back-end).


   public void paintAll (
         java.awt.Graphics a1
      )
   {
      int v0;
      v0 = 0;
      while((v0 < this.n)) {
         this.plotCell(this.CellList[v0], a1);
         v0 = (v0 + 1);
      } //while
      this.updateStatusLine(a1);
      return;
   } // paintAll

Since our IR explicitly identifies high-level constructs (e.g.,loops), it is easy to instrument the code and/or perform optimizing transformations.

Section 8.5 (Rochester) Early Experiments -- Network-based Data Mining

We have implemented a preliminary version of network-based data mining at Rochester. Currently, the server is a passive entity in terms of the data mining computation. It's only job is to accept client connections and ship off requested data. The mining process is carried out at the client end. Both the client and server are currently sequential.

We now present some timing results vis-a-vis a C++ implementation of the algorithm which doesn't transmit any data over the network. Below we show two synthetic databases from IBM we looked at:


Database      Num_Trans Avg_Trans_Size  Total_Size
-----------------------------------------------------------
T5.I2.D100K   100,000         5        2.615 Mbytes
T10.I4.D100K  100,000         10       4.31 Mbytes
These database are accessible only to the server through a
file-server. 

We now present the total execution time for the algorithm. The
Java_Time includes the server disk-read time, transmission time and
client computation time. C++_Time only gives the disk-read time and
computation time. 
Database      Total_Size    Java_Time   C++_Time  
--------------------------------------------------
T5.I2.D100K   2.615 Mbytes    367.7s    15.3s     
T10.I4.D100K  4.31 Mbytes     1847.2s   78.9s

The break-up of the execution time is as follows:
Database    Total_Size  Java:Reading Computation C++:Reading Computation  
------------------------------------------------------------------------
T5.I2.D100K   2.615 Mbytes    292.6s   75.1s     8.8s     6.5s  
T10.I4.D100K  4.31 Mbytes     478.4s   1368.8s   13.8s    65.1s

In Java the net disk-read time and shipping time for data is at the rate of
9Kb/sec vs. 0.31Mb/sec disc read time in C++.

From the experiments, we make two observations:

(1) interpretation is too slow for computationally intensive applications (in the case of database T10.I4.D100K, 1368s vs 65s);
(2) we should pay the network communication cost only when the server is too overloaded to perform the computation.

We will address these issues in our future research.

8.6: Cooperating Systems Experiments

These will be displayed at their Site.

References

BAY94] B. Bayerdorffer,,"Distributed Programming with Associative Broadcast" Proceedings 28th HICSS '95, Jan. 1995.

GEL85] D. Gelernter," Generative communication in linda", ACM Trans. Prog. Languages and Systems, vol. 7, no. 1, Jan. 1985