For the sake of concrete discussion we will make an assumption that, in the kind of Grande applications where MPI is likely to be used, some of the most pressing performance issues concern arrays and multidimensional arrays of small objects--especially arrays of primitive elements such as ints and floats. For benchmarks we therefore concentrated on the overheads introduced by object serialization when the objects contain many arrays of primitive elements. Specifically we concentrated on communication of two-dimensional arrays with primitive elements.
The ``ping-pong'' method was used to time point-to-point communication of an N by N array of primitive elements treated as a one dimensional array of objects, and compare it with communication of an array without using serialization. As an intermediate case we also timed communication of a 1 by array treated as a one-dimensional (size 1) array of objects. This allows us to extract an estimate of the overhead to ``serialize'' an individual primitive element. The code for sending and receiving the various array shapes is given schematically in Figure 2.
Figure 2: Send and receive operations for various array shapes.
As a crude timing model for these benchmarks, one can assume that there is a cost to serialize each primitive element of type T, an additional cost to serialize each subarray, similar constants and for unserialization, and a cost to physically tranfser each element of data. Then the total time for benchmarked communications should be
These formulae do not attempt to explain the constant initial overhead, don't take into account the extra bytes for type description that serialization introduces into the stream, and ignore possible non-linear costs associated with analysing object graphs, etc. Empirically these effects are small for the range of N we consider.
All measurements were performed on a cluster of 2-processor, 200 Mhz UltraSparc nodes connected through a SunATM-155/MMF network. The underlying MPI implementation was Sun MPI 3.0 (part of the Sun HPC package). The JDK was jdk1.2beta4. Shared memory results quoted are obtained by running two processes on the processors of a single node. Non-shared-memory results are obtained by running peer processes in different nodes.
In a series of measurements, element serialization and unserialization timing parameters were estimated by independent benchmarks of the serialization code. The parameters and were estimated by plotting the difference between serialization and unserialization times for T[1][ ] and T[N][N]. The raw communication speed was estimated from ping-pong results for . Table 1 contains the resulting estimates of the various parameters for byte and float elements.
Table 1: Estimated parameters in serialization and communication timing
model. The values
are respectively for non-shared memory ( ) and shared memory ( )
implementations of the underlying communication. All timings are in
microseconds.
Figure 3 plots actual measured times from ping-pong benchmarks for the mpiJava sends and receives of arrays with byte and float elements. In the plots the array extent, N, ranges between 128 and 1024. The measured times for , and are compared with the formulae given above (setting the c constants to zero). The agreement is good, so our parametrization is assumed to be realistic in the regime considered.
Figure: Communication times from Pingpong benchmark in non-shared-memory
and shared-memory cases. The lines represent the model
defined by Equations 1 to 3 in the text, with
parameters from Table 1.
According to table 1 the overhead of Java serialization nearly always dominates other communiation costs. In the worst case--floating point numbers--it takes around 2 microseconds to serialize each number and a smaller but comparable time to unserialize. But it only takes a few hundredths of a microsecond to communicate the word through shared memory. Serialization slows communication by nearly two orders of magnitude. When the underlying communication is over a fast network rather than through shared memory the raw communication time is still only a fraction of a microsecond, and serialization still dominates that time by about one order of magnitude. For byte elements serialization costs are smaller, but still larger than the communication costs in the fast network and still much larger than the communication cost through shared memory. Serialization costs for int elements are intermediate.
The constant overheads for serializing each subarray, characterized by the parameters and are also quite large, although, for the array sizes considered here they only make a dominant contribution for the byte arrays, where individual element serialization is relatively fast.