DRAFT: This document is not yet finalized. Please do not quote.

Java Grande Forum

Report of the Concurrency & Application Working Group

September 28, 1998

Preface
I. Critical JDK Issues

1. Performance of Object Serialization

1.1. Slim Encoding of Type Information
1.2. Efficient Serialization of Floats and Doubles
1.3. Reflection Enhancements
1.4. Implementation Improvements

2. Performance of RMI

2.1. Enhanced Connection Management
2.2. Careful Resource Consumption
2.3. Custom Transport

3. Other Suggestions

3.1 Source Code Availability
3.2 Class Compatibility
3.3 Dynamic Class Loading from Remote Hosts

II. Benchmarks
III. Seamless Computing
IV. Other Parallel Issues
V. Participants in the Applications & Concurrency Working Group
VI. References

Preface

By Dennis Gannon and George Thiruvathukal

The primary concern of the Java Grande Forum (hereafter JGF) is to ensure that the Java language, libraries and virtual machine can become the implementation vehicle of choice for future scientific and engineering applications. The first step in meeting this goal is to implement the complex and numerics proposals described in the report of the Numerics Working Group. Accomplishing this task provides the essential language semantics needed to write high quality scientific software. However, more will be required of the Java class libraries and runtime environment if we wish to capitalize on these language changes. The Java Grande Forum Applications & Concurrency Working Group (hereafter ACG) focuses on these issues.

It is possible that many of the needed improvements will be driven by commercial sector efforts to build server side enterprise applications. Indeed, the requirements of technical computing overlap with those of large enterprise applications in many ways. For example, both technical and enterprise computing applications can be very large and they will stress the memory management of the VM. The demand for very high throughput on network and I/O services is similar for both. Many of the features of the Enterprise Bean model will be of great importance to technical computing.

But there are also areas where technical computing is significantly different from Enterprise applications. For example, fine grained concurrency performance is substantially more critical in technical computing where a single computation may require 10,000 threads that synchronize in frequent, regular patterns. These computations would need to run on desktops as well as very large, shared memory multiprocessors. In technical applications, the same data may be accessed again and again, while in enterprise computing there is a great emphasis on transactions involving different data each time. Consequently, memory locality optimization may be more important for Grande applications than it is elsewhere in the Java world. Some technical applications will require the ability to link together multiple VMs concurrently executing on a dedicated cluster of processors which communicate through special high performance switches. On such a system, specialized, ultra low latency versions of the RMI protocol would be necessary. (In such an environment, an interface to shared memory, via RMI or the VM, would also be desirable.)

It is important to observe that there are problems which can be described as technical computing today which will become part of the enterprise applications of the future. For example, image analysis and computer vision are closely tied to application of data mining. The processing and control of data from arrays of sensors has important applications in manufacturing and medicine. The large scale simulation of non-linear mathematical systems is already finding its way into financial and marketing models. It is also the case that many technical computing applications do impact our day-to-day lives, such as aircraft simulation (the recent design of the Boeing 777) and weather forecasting. At least in the case of aircraft design, the industry has a valuation in the billions of dollars, which means it is far from being merely niche area being of limited interest.

This document is part of a larger report being written by the Java Grande Forum members. There are two major sections of this report: Numerics and Concurrency/Applications. For all practical purposes, each of these documents is self-contained and thus can be read separately.

Organization

This section of the Java Grande Report pertains to Concurrency/Applications. It is organized as follows:

critical JDK issues

benchmarks
seamless computing
other parallel and distributed computing issues

In this report, we present preliminary findings of the working group. We welcome a continuing discussion of these issues. Please send questions or comments to javagrandeforum@npac.syr.edu.

I. Critical JDK Issues

By Michael Philippsen and George Thiruvathukal

Sequential VM performance is of utmost importance to develop Grande applications. Since there are many groups working on this issue, the ACG simply provides some additional kernel benchmarks illustrating performance aspects in areas that are particularly important for Grande applications.

In addition to sequential VM performance, Grande applications require high performance for parallel and distributed computing. Although some more research is needed on other paradigms that might be better suited for parallelism in Java, this report will focus on RMI (Java's remote method invocation mechanism), since there is wide-spread agreement on both the general usefulness and the deficiencies of RMI.

In general, RMI provides the capability of allowing objects running in different JVMs to collaborate. Current RMI is specifically designed for peer-to-peer client/server-applications that communicate over the commodity Internet or over Ethernet. For high performance scientific applications, however, some of RMI's design goals and implementation decisions are inappropriate and cause serious performance limitations. This is especially troublesome on platforms targeted by the Java Grande community, i.e., closely connected environments, e.g. clusters of workstations and DMPs that are based on user-space interconnects like Myrinet, ParaStation, or SP2. On these platforms, a remote method invocation may not take longer than a few 10 microseconds.

The choice of a client/server model may prove too limited for many applications in scientific computing, which usually take advantage of collective communication as found in Message Passing Interface (MPI); however, the goal of this document is not to propose such sweeping changes to RMI. We rather choose to focus on how to make RMI, a client/server design, suitable for Grande applications with no changes to the core model itself. It is our hope that a better RMI design and implementation will stimulate community activities to support better communication models that are well-suited to solving community problems.

ISSUE 1 : Performance of Object Serialization

Requirement: Fast remote method invocations with low latency and high bandwidth are essential, especially in fine grained areas of science and engineering. Since object serialization and parameter marshaling are the mechanisms used for passing parameters to remote calls and the associated cost(s) amount to a significant portion of the cost of remote method invocation, serialization should be as fast as possible.

In an ideal solution, the exact byte representation of an object would be sent over the network and turned back into an object at the recipient's side without any unnecessary buffering and copying.

The ACG understands that the JDK's object serialization is used for several purposes, e.g. for long term storage of persistent objects and for dynamic class loading on remote hosts via http-servers. It is obvious that some of these special purpose uses require properties that are either costly to compute at runtime (latency) or that are verbose in their wire representation (bandwidth).

However, since some of these features are not used in Grande applications, there is room for improvement. The following subsections identify particular aspects of the current implementation of serialization that result in bad performance. The problems are described, and some solutions are suggested. Where possible, some benchmark results demonstrate the quantitative effects of the proposed solution.

Experiment: Experiments at Amsterdam [4] indicate that easily up to 30% of the run time of a remote method invocation are spent in the serialization, most of which can be avoided by compile time serialization.

1.1 : Slim Encoding of Type Information

Problem: For every type of object that is serialized, the current implementation prepends a complete description of the type, i.e., all fields of the type are described verbosely. For a single serialization connection, every type is marshaled only once. Subsequent objects of the same type use a reference number to refer to that type description. Type description is useful when objects are stored persistently and when the recipient does not have access to the byte code representation of the type.

When RMI uses the serialization for marshaling of method parameters, a new serialization connection is opened for every single method invocation. (More specifically, the reset method is called on the serialization stream.) Hence, type information is marshaled over and over again, thus consuming both latency and bandwidth. The current implementation cannot keep the connection open, because the serialization would otherwise refrain from re-sending arguments with modified instance variables. (Note, that the whole structure of objects reachable from argument objects is serialized; one of the objects deeply burried in that graph might have changed.)

Approach: For Grande applications it can be assumed that all JVMs that collaborate on a parallel application use the same file system, i.e., can load all classes from a common CLASSPATH. Furthermore, it is safe to assume that the life time of an object is within the boundaries of the overall runtime of the application. Hence, there is no need to completely encode and decode the type information in the byte stream and to transmit that information over the network.

Experiment: At Karlsruhe University, a slim type encoding has been implemented prototypically [1]. It has improved the performance of serialization significantly by avoiding the latency of complicated type encoding and decoding mechanisms. Moreover, some bandwidth can be saved due to the slimmer format. Figure AC-1 shows the runtime of standard serialization in the first/blue row and the runtime of the improved serialization with slim type encoding in the second/red row. The effect is much more prominent on the side of the reader (right two bars, 2) than on the side of the writer (left two bars, 1).

Figure AC-1: Serialization with slim type encoding

Solution: The ACG sees two options to avoid costly encoding of type information.

An additional ProtocolVersion is added to the implementation of serialization. Both the rmiregistry and the RMI user programs must select the new protocol version. For that purpose, the rmiregistry will need a new command line option (protocol number). The RMISocketFactory will need a new method setProtocolVersion(int) that RMI user programs must call.
Alternatively, some performance improvements would be reached if the object serialization would offer two types of reset methods. Similar to the current reset method, one method would reset the cached information on both types and objects. The other method would reset only the information on objects that have already been sent.
The more general solution is similar to the socket factory approach where the user can provide the name of the class that implements his own implementation of object serialization. Both the rmiregistry and the RMI user programs must select the new protocol version. For that purpose, the rmiregistry will need a new command line option (class name). Similar to RMISocketFactory, an RMISerializationFactory is added to the API that RMI user programs must initialize.

The ACG favors the first approach because it will require less changes in the current specification. Although the second approach would help a little, the complete type information would still be sent, yet less frequently.

1.2 : Handling of Floats and Doubles

Problem: Since in scientific applications, floats and arrays of floats are used frequently (the same holds for doubles), it is absolutely essential, that these data types are packed and unpacked efficiently. Nevertheless, the current serialization does not handle these primitive types efficiently.

The conversion of these primitive data types into their equivalent byte representation is (on most machines) a matter of a type cast. However, in the current implementation, the type cast is implemented in the JNI and hence requires various time consuming operations for check pointing and state recovery upon JNI entry and JNI exit. Moreover, the serialization of float arrays (and double arrays) currently invokes the above mentioned JNI routine for every single array element.

Approach:

The costly JNI-entry/exit can be avoided by teaching the JITs to recognize and inline the conversion of single floats or doubles into byte arrays. The conversion routine could even be included into the JVM.
For arrays, it is much faster to enter the JNI only once for the whole array (or at least for the section that still fits in the available communication packet).
If the JNI code must be retained, at a minimum we believe a single JNI call should be made to convert multiple scalars. See section on Reflection Enhancements.

Experiment: A prototype at Karlsruhe University (see Figure AC-2, [1]) stresses the effect of this approach.

Figure AC-2: Serialization of float arrays

Solution: It is absolutely essential that

floats and doubles are turned into their corresponding wire representation as efficiently as possible and that
arrays of floats and doubles are serialized efficiently.

And since there are easy ways to do it that do not require any change of an API, this should be done in the next release of the JDK.

1.3 : Reflection Enhancements

Problem: A byte array is used as communication packet. During serialization, every single instance variable is copied into/from that byte array individually. Most of this copying is done at the Java level using reflection. (As mentioned above, only for floats and doubles the JNI is asked to return the appropriate byte representation.)

In contrast to zero-copy protocols that are used (or attempted to be used) in other messaging protocols, where data is copied directly from user data structures to the network interface board, at least two complete copy operations (one at the sender side and one at the recipient side) are needed in every pure Java implementation. (Unfortunately, current implementations do much more copying that this.) Although the ACG regrets it, it seems very unlikely that future JVM implementations will closely interact with the communication mechanisms to allow for zero-copy protocols. This seems to be one of the price tags caused by Java's portability that can only be avoided by compile time serialization, see [4].

Approach: Obviously, there should be as few copy operations as possible. Those that remain should be performed as fast as possible.

Solution: Better performance can be achieved in two ways

The object stream opens up its communication buffer for public access so that the object itself can use a writeExternal routine to write into that buffer immediately, i.e., without reflection.

Since the ACG does not believe that the communication buffer will ever be made public, the following proposal seems to be more promising.

The reflection mechanism should be enhanced so that it can copy all instance variables into a buffer at once with a single method call. For example, class Class could be extended to return an object of a new class ClassInfo:
ClassInfo getClassInfo(Field[] fields);
The object of type ClassInfo then provides two routines that do the copying to/from the communication buffer.
int   toByteArray(Object obj, int objectoffset,
                  byte[] buffer, int bufferoffset);
int fromByteArray(Object obj, int objectoffset,
                  byte[] buffer, int bufferoffset);
The first routine copies the bytes that represent all the instance variables into the communication buffer (ideally the network interface board), starting at the given buffer offset. The first objectoffset bytes are left out. The routine returns the number of bytes that have actually been copied. Hence, if the communication buffer is too small to hold all bytes, the routine must be called again, with appropriately modified offsets.

Experiment: A serialization with better buffering and less copying has been implemented at Karlsruhe University [1]. The prototype is based on an enhanced reflection mechanism, that can deal with all instance variables of an object at once. (This has been hacked into native code that is called through the JNI.) The performance numbers are shown in figure AC-3 [1]. In the figures, "std" refers to the standard serialization, "ext" indicates the fact that the object provides a writeExternal routine, "buf" incorporates more efficient internal buffering, and "native" uses the hack that is supposed to be replaced by an enhance reflection.

Figure AC-3: Serialization of several instance variables

The ACG strongly recommends to allow for more efficient handling of the communication buffer, since performance of serialization can be increased by a factor of more than 10.

1.4 : Implementation Improvements

Problem: In addition to the suggestions for improved implementation of object serialization, object serialization should take about the same time for writing and reading. Otherwise, the pipeline between sender and receiver shows an imbalance resulting in wasted time.

Experiment: In the current implementation, it is much faster to produce the byte array representation of an object than it is to recreate the object. The severity of this problem, that is probably caused by the overhead of object creation in Java, can be seen in Figures AC-1 to AC-3 above.

Approach: The ACG cannot provide any specific suggestions here.

ISSUE 2 : Performance of RMI

Requirement: Fast remote method invocations with low latency and high bandwidth are essential, especially in fine grained areas of science and engineering. Apart from object serialization, there are other aspects of RMI that should be improved.

Experiment: Experiments at Amsterdam [4] indicate that easily up to 30% of the run time of a remote method invocation are spent in the RMI implementation. Other 30% are spent in the network. By compiling Java code to native, a latency of a few 10 microseconds for a remote method invocation can be achieved on Myrinet. About the same performance is needed for jitted pure RMI code on custom interconnection hardware.

2.1 : Improved Connection Management

Problem: The current RMI implementation closes and re-opens connections to other hosts too frequently. For object serialization streams, that causes the re-transmission of type information, as has been discussed above. Re-creation of socket connections is a costly operating system job.

Approach: Whereas RMI's approach seems to be useful in case of unstable wide area networks, for closely connected JVMs this approach is far too pessimistic. Socket connections should be left open for the duration of the user program. At least, it might make more sense to keep a working set and use a least recently used algorithm to kick connections out of the working set.

Solution: To achieve this, the user should have an option to switch RMI into that mode. Again, this requires a command line option for the rmiregistry and another method in the RMISocketFactory class.

2.2 : Careful Resource Consumption

Problem: In the current RMI implementation, every single remote object uses a port of its own. In addition, a thread is started monitoring that port. Every remote method invocation causes the creation of a new socket and a new thread. In addition, a watch dog thread monitors the state of a connection. Since object creation is known to be quite slow in current JVMs and since thread creation is even slower, this approach is not amenable for high performance comptuing, especially for fine grained problems.

Approach: Instead of creating new objects and threads almost on every RMI activity, the internal layers of RMI should reuse sockets. Moreover, worker threads from a thread pool should repeatedly service incoming requests. The underlying asumption is that the network is quite stable and communication failure will cause failure of the whole application.

Problem: Operating systems typically allow one thread to monitor several socket connections by means of a select statement. There is currently no way for a Java thread to do the same. Instead, one Java thread is needed for every single open socket connection. This is not only a significant overhead for the JVM's thread sceduler but it prevents efficient implementations of socket based custom transports layers for RMI.

Solution: The JPACWG strongly recommends that JDK's sockets should provide a select statement. The RMI transport implementation should then make use of it.

2.3 : Custom Transport

Requirement: Fast remote method invocations with low latency and high bandwidth are essential, especially in fine grained areas of science and engineering. It is absolutely essential that non-Ethernet special purpose high-end communication hardware can be used.

Problem: It is almost impossible, to use specialized, high performance network protocols through RMI. Although SCI, ATM AAL5, Myrinet, ParaStation, Active Messages are all used in technical applications, there is no straightforward way for Java applications to make use of them.

RMI is tied to TCP. The current RMI implementation is tightly connected to TCP sockets. Whenever UnicastRemoteObject and UnicastServerRef are used, a TCP transport is selected implicitly.
The transport cannot be replaced individually, since the implementation does not properly isolate the transport layer. More specifically, Unicast source code has lots of type casts that explicity use TCP sockets and their methods. Hence, to get rid of the TCP transport, a completely new type of server needs to be implemented by hand as well.
Based on the current implementation and documentation of RMI, this work is comparable to re-inventing most of RMI's functionality. Although deeply burried in the internal layers of the RMI implementation, the transport seems to be plugged in very flexibly, there is no way that flexibility can be exploited without proper documentation.
Non-Socket-Services. Most available high speed protocols do not operate on the socket level, e.g. FM and Nexus. Although RMI offers a socket factory where user defined socket classes can be plugged in, this is unsufficient because it would require a costly socket protocol to be implemented on top of the high speed base functionality.
There should be a way to make use of non-socket protocols in RMI. The following aspects need to be considered:
- For sockets (and TCP/IP) most of the necessary protocol initialization is done by the operating system. This explains that the current implementation does not offer appropriate interfaces to plug in specific protocol initialization code as needed by other protocols, especially by Nexus. It might be possible to use the static class initializer of a abstract transport protocol class for that purpose but that has to be determined.
- Another related problem that is caused by the fact that the current approach is based on sockets is the lack of an interface to provide addressing information to interact with other transport protocols. IP and DNS names are not the only meaningful forms. For example, on the SP2 or a Cray T3E, a combination of IP# and node number. (e.g., quad.mcs.anl.gov/25, for node 25 at quad.mcs.anl.gov) is used.
User-Level-Services. Some high speed protocols offer socket like functionality in the user level, i.e., the operating system is left out for performance reasons.
Since these protocols offer socket functionality, it seems to be easy to wrap them in a Java socket class that is then plugged into the RMI socket factory. Unfortunately, that approach does not work properly.
The problem is that SocketImpl returns file descriptors that are later on used by the JVM's thread scheduler. It issues calls on read/write/select methods which are only useful on the level of the operating system. For example, a select might be executed that waits for the arrival of a communication packet in the kernel although that packet has already arrived at the user level.
Hack around. The only reasonable way to go seems to be to load a dynamic library that replaces the standard operating system calls with those that can handle the high end communication hardware. This replacement is done unnoticed by the JVM. Such an approach has been successfully tested at the University of Karlsruhe on ParaStation hardware, it required an undesirable source code modification of the rmiregistry implementation (to load the dynamic library before main).

The JPACWG strongly recommends that RMI will be extended to allow for high-speed communication hardware to be used. Since the source code of the lower layers of the RMI implementation are not public, no specific suggestions can be made in this report. The minimal requirement is that the RMI group provides documentation on how to plug in alternative transport implementations.

ISSUE 4 : Other Suggestions

4.1 : Source Code Availability

Problem: The RMI implementation needs class files from several packages. A lot of the code is contained in sun.rmi.*. For this code, the Java source is not released. In particular, there is no source code for the various versions of JDK 1.2.

There are people who are willing to work on that code to improve it's runtime efficiency. Because of the missing source code (and the fact that de-compilers do not recreate the comments) the missing source code these people are prevented from doing work the ACG is interested in.

Solution: The ACG suggests that Sun will include the source code of current RMI implementations either in the standard JDK distribution or will create a process so that interested research groups can get access to it before a release gets final.

4.2 : Class Compatibility

Problem: The class compatibility test is too stringent. In the case where one is not using NFS (more common than the case where one is using it), it appears one can take the same code (unmodified), compile it with the same version of JDK (resulting in the same class files), and cause an exception, because the classes are deemed incompatible.

Solution: There may be no easy fix for this. Compatibility should probably be defined in terms of the version of Java being used as it is now; however, two classes with the same class and instance variables should be compatible, provided they are compiled with the same Java compiler.

Experiment: None known. Related to this problem is section 4.3 for which an experimental prototype has been constructed.

4.3 : Dynamic Class Loading from Remote Hosts

Problem: The RMI implementation includes a network class loader (for which very little documentation exists on its purpose or usage). The basic idea and purpose of a network class loader is to load classes from the network instead of a local file system. The essence of how to build a network class loader is shown in the Class Libraries Reference by Lee and Chan (accomplished by extending the ClassLoader class in Java).

The notion of loading classes from the network is already known to be useful in the case of applets, where the code lives on the server in a code directory (often called the code "base") and can be downloaded to a web browser and loaded, subject to being verified by the class verifier. This is well known not to be without problems, because different browsers may not support the same Java (see Class Compatibility above); however, in principle it makes the life of a client easy.

In the case where a browser is not being used, it is still useful to consider the notion of a network class loader. In scientific and technical computing involving a large number of nodes (as found in massively parallel machines and clusters), the use of a network class loader can facilitate the deployment of code on a large number of nodes where there is not a shared file system (e.g., NFS or AFS). As it currently stands, when file systems are not shared, the programmer is forced to copy the class files to all nodes in a network where file systems are not shared. This can be quite cumbersome, although systems such as Unix provide remote commands to make the task easier. For Windows users, there is presently inadequate support for remote copying of files and remote execution, meaning that Windows file sharing must be used (still leaving one without adequate remote execution facilities). When heterogeneous systems are involved, the problem of deployment is even more complex, where everyone involved needs to know system administration.

Experiment: Thiruvathukal, Thomas, and Korczynski of Loyola University and jhpc.org have shown how to extend the class loader for a version of RMI called RRMI [3], where the client can start very thin and most, if not all, classes are loaded from the network. The Globus project has shown the importance of having facilities for remote execution and job control.

II. Benchmarks

By Martin Westhead

The purpose of developing a suite of benchmark tests is to provide ways of measuring and comparing alternative Java execution environments. In constructing this benchmark the aim is to provide a series of tests which challenge the execution environments in ways which are important to Grande applications. The benchmark has been divided into three sections:

Low level operations - measuring the performance of low level operations such as serialization, RMI, garbage collection.
Kernels - short codes which carry out specific operations frequently used in grande applications.
Large scale applications - real grande codes, less useful for comparative performance studies but worthwhile to demonstrate the potential of Java for tackling real problems.

ACG feels that it is important wherever possible that the benchmarks should be open source to allow users to understand exactly what the benchmark is testing.

EPCC in the University of Edinburgh is coordinating this effort, at time of writing, and are in the process of collecting existing benchmark tests to form a coherent suite. The aim of this collation is to have a package that can be downloaded and run as a whole, with consistent output format and a consistent definition of terms. This suite can then be added to over time as more and better tests are developed. The work being done at Edinburgh is ongoing and details can be found at:

http://www.epcc.ed.ac.uk/research/javagrande/
Comments suggestions and contributions from the forum and the community at large are invited and any contributions would be warmly welcomed (please email: epcc-javagrande@epcc.ed.ac.uk).

The following sections elaborate on the proposed structure for the benchmark including examples of the tests that could be included. In most cases these tests are either available or have been promised.

It is anticipated that up to three different sizes of dataset should be available for section 2 (and maybe even two for section 3).

The user also should be able to run:

any benchmark individually, no matter what section
all of section 1
all of section 2 with a specific application size
all of section 3 with a specific application size

It seems unlikely that users would wish to run the entire benchmark suite in one go, but this facility could be added for completeness.

For timed benchmarks, the benchmarks will report a raw time (in seconds) and a performance measure in X/second, where the operation X is specific to the given benchmark (e.g. flops, method calls, iterations, etc.). Non-timing benchmarks will typically report a single figure (e.g. maximum size of object), but units should be consistent (all memory sizes in bytes, for example).

Low Level Operations

loop overhead
access variables and arrays
method invocation
execution of arithmetic operations
casting
object and array creation/instantiation
exception handling
memory management and garbage collection (Piyush Mehrorta)
thread creating / switching
threads - synchronization, scalability (Dennis Gannon)
I/O scalability (Dan Reed?)
RMI/serialization (Michael Philippsen/Bernhard Haumacher)

Kernels

FFT
Numerical integration
SOR
LU Factorisation
Sparse Matrix multiply
Video/audio (de)compression
Searching
Sorting(George Thiruvathukal)

Large Scale Applications

Parallel Geophysics Operations in Java(University of Karlsruhe)
Monte Carlo simulation (NCSA/EPCC)
Discrete Event Simulation (INRIA/EPCC)
Image Analysis, Radio Astronomy (NCSA)
Gravitational N-Body simulations (Indiana)
Computational Fluid Dynamics (Syracuse)

III. Seamless Computing

For the average scientist and engineer one of the greatest difficulties in doing large scale computation is the constant struggle requires to port applications to a new environment. This involves the following tasks:

Dealing with authentication and authorization at a remote site.
Finding the required libraries to be able to link the application.
Understanding the batch scheduler.
Moving files from one site to the new one and then moving results back.
Comprehending the local parallel file system.
Cataloging and recording application changes and experimental results.

A seamless technical computing environment would allow a Java based programming environment that could provide a uniform interface to all these remote resources. Java based agents can be installed at each site which cooperate with the user and guide him through the resource discovery and authorization process and provide an integrated development environment for using these remote resources.

It is possible that such a system can be built on top of some of the existing and emerging meta-computing infrastructures. Many of these provide the tool kit and components to build on, and a few have partial solutions to the problems listed above.

Related projects are:

Unicore
Globus
Legion
Condor
Sweb
Websubmit
Abatros

With a collective effort of the ACG it should be possible to do much more. The ACG envisions an interface that is widely supported by the hardware vendors, much like the JDBC interface for database access.

Such an interface should deal with the following issues:

Authentication
Resource discovery (metadata)
Resource allocation (Grande cache)
Accounting
Resource and job Scheduling
Job monitoring
Code/Data-filesystem
Connected to resource discovery service (JINI)

To define such an interface, a close examination of the various technologies is in order.

IV. Other Parallel Issues

The role parallel computation plays in high performance technical computing cannot be under estimated. There are several ways to building a Java parallel computing environment. The following approaches are controversial and require community research.

Grande parallelism design patterns. One approach is to take the experience of the last ten years of parallel programming and build a set of Grande parallelism design patterns that can be cast as a set of interfaces and base classes that simplify the task of writing parallel Grande applications. This API can then be hosted on either a set of concurrently executing JVMs or in an environment where large numbers of native threads are well supported. Such an API may be as simple as defining a truly object oriented version of MPI (see below), or it may define a new category of distributed object aggregates and collective operations.
Grande Beans. A second approach that may be more consistent with current Java directions would be to design a Grande Bean specification that extends the basic Bean model to one appropriate for technical applications. This would follow what has been done with Enterprise Beans for transaction oriented business applications. The Enterprise Beans model has allowed CORBA based resources to be woven into a unified component model. Grande beans can build upon this to incorporate high end, parallel computational modules and visualization and VR tools into a grid of resources controlled by the JVM on the users desktop system.
MPI. The ACG agrees that MPI is necessary in the Java environment. However, it is controversial how to make MPI available.

Pure bindings

Interface style issue
Not feasible or sensible to do whole MPI
Issues with heterogeneous language interaction

Extended MPI flavor, JMPI, truly object oriented

Object model: value semantic in procedure calls, remote reference management, problems for large objects [API]
Threads issues?

V. Participants in the Applications & Concurrency Working Group

The following individuals contributed to the development of this document at the Java Grande Forum meetings on May 9-10 and August 6-7 in Palo Alto, California.

Geoffrey Fox, University of Syracuse

Dennis Gannon, Indiana, Chair

Vladimir Getov, University of Westminster, UK

Piyush Mehrotra, ICASE

Michael Philippsen, University of Karlsruhe, Germany

Omer Rana, University of Wales, Cardiff, UK

Tony Skjellum. MPI-Softtech

Henry Sowizral, Sun

George Thiruvathukal, Loyola University, Chicago

Julien J.P. Vayssiere, Schlumberger House, Norway

Martin Westhead, EPCC, University of Edinburgh, UK

The following additional individuals also contributed comments which helped in the development of this document.

Henri Bal, Vrjie Universiteit Amsterdam, The Netherlands

Denis Caromel, INRIA, France

Bernhard Haumacher, University of Karlsruhe, Germany

VI. References

[1]	The benchmarks have been performed on a 300 Mhz UltraSparc II with JDK 1.2beta3, JIT enabled. JDK 1.2 (production release) shows numbers that are similar in ratio.
[2]	Fabian Breg, Shridhar Diwan, Juan Villacis, Jayashree Balasubramanian, Esra Akman, and Dennis Gannon: RMI Performance and Object Model Interoperability: Experiments with Java/HPC++. Concurrency: Practice and Experience, volume 10, 1998, to appear.
[3]	G.K. Thiruvathukal, L.S. Thomas, and A. T. Korczynski: Reflective Remote Method Invocation. Concurrency: Practice and Experience, volume 10, 1998, to appear.
[4]	Ronald Veldema, Rob van Nieuwpoort, Jason Maassen, Henri E. Bal, and Aske Plaat: Efficient Remote Method Invocation. Technical Report IR-450, Dept. of Computer Science, Vrjie Universiteit Amsterdam, The Netherlands, September 1998.
[5]	Jim Waldo: Remote Procedure Calls and Java Remote Method Invocation. IEEE Concurrency. Pages 5-7, September 1998.