Next: Overview of the Architecture Up: Thoughts on the structure Previous: Acknowledgements

Some design decisions

As noted above, an MPJ ``reference implementation'' can be implemented as Java wrappers to a native MPI implementation, or it can be implemented in pure Java. It could also be implemented principally in Java with a few simple native methods to optimize operations (like marshalling arrays of primitive elements) that are difficult to do efficiently in Java. In this note we will focus on the latter possibilities--essentially pure Java, although experience with DOGMA and other systems strongly suggests that optional native support for marshalling will be desirable. The aim is to provide an implementation of MPJ that is maximally portable.

We envisage that a user will download a jar-file of MPJ library classes onto machines that may host parallel jobs. Some installation ``script'' (preferably a parameterless script) is run on the potential host machines. This script installs a daemon on those machines (probably by registering a persistent activatable object with an existing rmid daemon). Parallel java codes are compiled on any host. An mpjrun program invoked on that host transparently loads all the user's class files into JVMs created on remote hosts by the MPJ daemons, and the parallel job starts. The only required parameters for the mpjrun program should be the class name for the application and the number of processors the application is to run on. These seem to be an irreducible minimum set of steps; a conscious goal is that the user need do no more than is absolutely necessary before parallel jobs can be compiled and run.

In light of this goal one can sensibly ask if the step of installing a daemon on each host is essential. On networks of UNIX workstations--an important target for us--packages like MPICH avoid the need for special daemons by using the rsh command and its associated system daemon. Dispensing with the need for special installation procedures on target hosts would be a significant gain in simplicity, so this option needs serious consideration. In the end we decided this is probably not the best approach for us. Important targets, notably networks of NT systems, do not provide rsh as standard, and often on UNIX systems the use of rsh is complicated by security considerations. Although neither RMI or Jini provide any magic mechanism for conjuring a process out of nothing on a remote host, RMI does provide a daemon called rmid for restarting activatable objects. These need only be installed on a host once, and can be configured to survive reboots of the host. We propose to use this Java-centric mechanism, on the optimistic assumption that rmid will become as widely run across Java-aware platforms as rshd is on current UNIX systems.

An implementation ought to be fault-tolerent in at least the following senses. If a remote host is lost during execution, either because a network connection breaks or the host system goes down, or if the JVM running the remote MPJ task halts for some other reason (eg, occurrence of a Java exception), or if the process that initiated the MPJ job is killed--in any of these circumstances--all processes associated with the particular MPJ job must shut down within some (preferably short) interval of time. On the other hand, unless it is explicitly killed or its host system goes down altogether, the MPJ daemon on a remote host should survive unexpected termination of any particular MPJ job. Concurrent tasks associated with other MPJ jobs should be unaffected, even if they were initiated by the same daemon. These requirements likely put some restrictions on the portability of the daemon. They probably imply at least the ability to create a new JVM on demand, for example by using Runtime.exec to execute the java command. This facility is available in the major operating systems we target (UNIX and NT).

In the initial reference implementation we will probably use Jini technology[1, 7] to facilitate location of remote MPJ daemons and to provide a framework for the required fault-tolerance. This choice rests on our guess that in the medium-to-long-term Jini will become a ubiquitous component in Java installations. Hence using Jini paradigms from the start should eventually promote interoperability and compatibility between our software and other systems. In terms of our aim to simplify using the system, Jini multicast discovery relieves the user of the need to create a ``hosts'' file defining where each process of a parallel job should be run. If the user actually wants to restrict the hosts, unicast discovery is available. Of course it has not escaped our attention that eventually Jini discovery may provide a basis for much more dynamic access to parallel computing resources.

Less fundamental assumptions bearing on the organization of the MPJ daemon are that standard output (and standard error) streams from all tasks in an MPJ job are merged non-deterministically and copied to the standard output of the process that initiates the job. No guarantees are made about other IO operations--for now these are system-dependent. Rudimentary support for global checkpointing and restarting of interrupted jobs would be useful, although we doubt that checkpointing would happen without explicit invocation in the user-level code, or that restarting would happen automatically.

The main role of the MPJ daemons and their associated infrastructure is thus to provide an environment consisting of a group of processes with the user-code loaded and running, and running in a reliable way. As indicated above, the process group is reliable in the sense that no partial failures should be visible to higher levels of the MPJ implementation or the user code. As already explained, partial failure is the situation where some members of a group of cooperating processes are unable to continue because other members of the group have crashed, or the network connection between members of the group has failed. To quote [11]: partial failure is a central reality of distributed computing. No software technology can guarantee the absence of total failures, in which the whole MPJ job dies at essentially the same time (and all resources allocated by the MPJ system to support the user's job are released). But total failure should be the only failure mode visible to the higher levels. Thus a principal role of the base layer is to detect partial failures and cleanly abort the whole parallel program when they occur.

Once a reliable cocoon of user processes has been created through negotiation with the daemons, we have to establish connectivity. In the reference implementation this will be based on Java sockets. Recently there has been interest in producing Java bindings to VIA [4, 12]. Eventually this may provide a better platform on which to implement MPI, but for now sockets are the only realistic, portable option. Between the socket API and the MPJ API there will be an intermediate ``MPJ device'' level. This is modelled on the abstract device interface of MPICH [10]. Although the role is slightly different here--we don't really anticipate a need for multiple device-specific implementations--this still seems like a good layer of abstraction to have in our design. The API is actually not modelled in detail on the MPICH device, but the level of operations is similar.

Next: Overview of the Architecture Up: Thoughts on the structure Previous: Acknowledgements

Bryan Carpenter
Tue Nov 23 10:50:26 EST 1999