In concurrent computing environments based on heterogeneous processing elements interconnected by general-purpose networks, several classes of overheads contribute to lowered performance. The most obvious limitations are network throughput and latency, but certain other factors also play a significant role. In an attempt to gain some insight into the nature of these overheads, and to propose strategies to alleviate them, empirical measurements of native communication performance as well as application execution performance were conducted, using the PVM network computing system.
These experiments and our analyses have identified load imbalance, the
parallelism model adopted, communication delay and throughput, and
within-host overheads as the primary factors affecting performance in
cluster environments. Interestingly, we find that agenda parallelism
and load balancing strategies contribute significantly more to better
performance than improved communications or system tuning. Drawing
general conclusions on how these inefficiencies may be overcome is
inadvisable because of the tremendous variability of many parameters in
general purpose network environments; we therefore propose several
potential approaches, including model selection criteria, partitioning
strategies, and software system heuristics. to reduce overheads and
enhance performance in network based environments.
Short Code: [Schmidt:94a]
Reference: Vol. 6, No. 1, pp. 1--32 (C153)
Program unification is a technique for source-to-source transformation
of code for enhanced execution performance on vector and SIMD
architectures. This work focuses on simple examples of program
unification to explain the methodology and demonstrate its promise as a
practical technique for improved performance. Using simple examples to
explain how unification is done, we outline two experiments in the
simulation. Empirical tests of unified code on a CRAY Y-MP
multiprocessor show that unification improves execution performance by a
factor of roughly 8 for given application. The technique is general in
that it can be applied to computation-intensive programs in various
data-parallel application domains.
Short Code: [Chuang:94a]
Reference: Vol. 6, No. 1, pp. 33--53 (C159)
An algorithm is presented that solves a linear advection-diffusion
problem using a least-squares formulation and a conjugate gradient
method to solve the corresponding minimization problem. An
implementation in CM-Fortran on a Thinking Machines CM-2 is compared
with a serial implementation on an IBM RS6000. The maximum speed-up
obtained is a factor of 70. For fine grids, the CPU time scales
almost ideally when the number of processors is increased from 4096 to
8192.
Short Code: [Berggren:94a]
Reference: Vol. 6, No. 1, pp. 55--68 (C160)
Recently developed block-iterative versions of some row-action algorithms for solving general systems of sparse linear equations allow parallelism in the computations when the underlying problem is appropriately decomposed. However, problems associated with the parallel implementation of these algorithms have to be addressed.
In this paper we present an implementation on distributed memory multiprocessors of a block version of the Kaczmarz row-action method. One of the main issues related to the efficient implementation of this method on a concurrent environment is to develop suitable communication schemes in order to reduce the amount of communication needed at each iteration.
We propose two data distribution strategies which lead to different computation and communication schemes.
To verify and compare the effectiveness of the proposed strategies,
numerical experiments have been carried out on a Symult S2010 and a
Meiko Computing Surface. The performance evaluation has been done
using a scaled efficiency model.
Short Code: [Apuzzo:84a]
Reference: Vol. 6, No. 1, pp. 69--84 (C77)
The paper presents a Conservative Time Window (CTW) algorithm for
parallel simulation of discrete event systems. The physical system to
be simulated is partitioned into $n$ disjoint sub-systems, each of
which is represented by an object. The CTW algorithm identifies a
time window for each object, such that events occurring in each window
are independent of events in other windows and thus they can be
processed concurrently. The CTW algorithm was implemented on a shared
memory multiprocessor, a Sequent Symmetry S81 with 16 processors. We
measured performance of the CTW algorithm on two types of network
topologies: feed-forward networks and networks with feedback loops.
We used three metrics to measure performance: speed-up, average number
of independent windows detected by the algorithm, and average number
of events occurring in each window. We also investigated the impact
of various event scheduling policies on performance. The results
obtained show that the CTW algorithm produces good performance in many
Short Code: [Ayani:84a]
Reference: Vol. 6, No. 2, pp. 119--142 (C164)
We develop a balanced, parallel quicksort algorithm for a hypercube
and compare it with a similar algorithm for a binary tree machine. The
performance of the hypercube algorithm is measured on a Computing
Surface.
Short Code: [Hansen:84a]
Reference: Vol. 6, No. 2, pp. 143--151 (C199)
We present two strategies for the simulation of massive neural networks on message-passing MIMD machines. In the first strategy all interconnections between neurons are stored explicitly in interconnection matrices. During simulation, every processor is responsible for certain submatrices of these interconnection matrices. The fact that message-passing MIMD processors do not provide virtual memory seriously limits the size of the networks that can be simulated, since interconnection matrices require huge amounts of memory.
An alternative strategy is not to store the connections explicitly, but generate the interconnections as they are needed. This circumvents memory limitations, but because interconnections need to be generated multiple times, it is inherently slower than the first implementation.
This yields the connections dilemma: the choice between fast simulation of small networks as against slower simulation of massive networks.
We present, analyze and bench-mark parallel implementations for both strategies. An efficient connection-look-up algorithm, which can be used for any network with static interconnections, ensures that simulation times for the second strategy are only marginally longer than for the first strategy. We show that for our users the connections dilemma is no longer a dilemma: by means of our look-up algorithm the simulation of massive networks becomes possible; furthermore the time to design and construct a network, prior to simulation, is considerably shorter than it is for the matrix version, and in addition this time is independent of network size.
Although we have implemented both strategies on a parallel computer,
the algorithms presented here can be used on any machine with memory
limitations, such as personal computers.
Short Code: [Tollenaere:94a]
Reference: Vol. 6, No. 3, pp. 153--191 (C183)
In the paper three distinct approaches to the implementation of an
application in linear algebra on the AMT DAP 510 using the
Fortran-Plus Enhanced language and compiler are investigated. In the
first the implementation is tailored to fit a virtual array processor
whose edge size corresponds to that of the maximum dimension of matrix
used in the application. The second approach is a special case of the
first, and corresponds to a situation in which the virtual array
processor is constrained to have the same dimensions as the AMT DAP
510. The third approach may be considered to be a hybrid approach.
In this case an implementation of the application is constructed which
incorporates the most efficient features of the first two. Finally,
comparisons of the performances of the three implementations, all of
which were compiled using the Fortran-Plus Enhanced compiler, are
presented.
Short Code: [Clint:94a]
Reference: Vol. 6, No. 3, pp. 193--204 (C196)
The paper discusses data management techniques for mapping a large
data space onto the memory hierarchy of a distributed memory MIMD
system. Experimental results for structural biology computations
using the Molecular Replacement Method are presented.
Short Code: [Cornea:94a]
Reference: Vol. 6, No. 3, pp. 205--229 (C172)
Two kinds of parallel computers exist: those with shared memory
and those without. The former are difficult to build but easy to
program. The latter are easy to build but difficult to program. In
this paper we present a hybrid model that combines the best
properties of each by simulating a restricted object-based shared
memory on machines that do not share physical memory. In this
model, objects can be replicated on multiple machines. An
operation that does not change an object can then be done locally,
without any network traffic. Update operations can be done using
the reliable broadcast protocol described in the paper. We have
constructed a prototype system, designed and implemented a new
programming language for it, and programmed various applications
using it. The model, algorithms, language, applications and
performance will be discussed.
Short Code: [Tanenbaum:94a]
Reference: Vol. 6, No. 4, pp. 235--249
With the current advances in computer and networking technology coupled
with the availability of software tools for parallel and distributed
computing, there has been increased interest in high-performance
distributed computing (HPDC). We envision that HPDC environments with
supercomputing capabilities will be available in the near future.
However, a number of issues have to be resolved before future
network-based applications can fully exploit the potential of the HPDC
environment. In the paper we present an architecture for a high-speed
local area network and a communication system that provides HPDC
applications with high bandwidth and low latency. We also characterize
the message-passing primitives required in HPDC applications and
develop a communication protocol that implements these primitives
efficiently.
Short Code: [Hariri:94c]
Reference: Vol. 6, No. 4, pp. 251--270
The Numerical Propulsion System Simulation (NPSS) project has been
initiated by NASA to explore the use of computer simulation in the
development of new aircraft propulsion technology. With this
approach, each engine component is modeled by a separate
computational code, with a simulation executive connecting the
codes and modeling component interactions. Since each code
potentially executes on a different machine in a network, a
simulation run is a heterogeneous distributed program in which
diverse software and hardware elements are incorporated into a
single computation. In the paper a prototype simulation executive
that supports this type of programming is described. The two
components of this executive are the AVS scientific visualization
system and the Schooner heterogeneous remote procedure call (RPC)
facility. In addition, the match between Schooner's capabilities
and the needs of NPSS is evaluated based on our experience with a
collection of test codes. The basic conclusion is that, while
Schooner fared well in general, it exhibited certain deficiencies
that dictated changes in its design and implementation. This
discussion not only documents the evolution of Schooner, but also
serves to highlight the practical problems that can be encountered
when dealing with heterogeneity and distribution in such
applications.
Short Code: [Homer:94a]
Reference: Vol. 6, No. 4, pp. 271--287
We study the use of non-volatile memory for caching in distributed file systems. This provides an advantage over traditional distributed file systems in that the load is reduced at the server without making the data vulnerable to failures. We propose the use of a small non-volatile cache for writes, at the client and the file server, together with a larger volatile read cache to keep the cost of the caches reasonable.
We use a synthetic workload developed from analysis of file I/O traces from commercial production systems and use a detailed simulation of the distributed environment. The service times for the resources of the system were derived from measurements performed on a typical workstation.
We show that non-volatile write caches at the clients and the file
server reduce the write response time and the load on the file
server dramatically, thus improving the scalability of the system.
We examine the comparative benefits of two alternative writeback
policies for the non-volatile write cache. We show that a proposed
threshold based writeback policy is more effective than a periodic
writeback policy under heavy load. We also investigate the effect
of varying the write cache size and show that introducing a small
non-volatile cache at the client in conjunction with a moderate
sized non-volatile server write cache improves the write response
time by a factor of four at all load levels.
Short Code: [Biswas:94a]
Reference: Vol. 6, No. 4, pp. 289--323
The paper describes the approach taken for configuration management
in the Nexus distributed operating system. Nexus uses kernel-level
support for monitoring the failure or termination status of
distributed components of an application. Periodic user-level
messages are not required for status monitoring. Group and
dependency relationships between such components can be defined by
the programmer for the purpose of configuration monitoring and
management. An object belonging to a distributed application can
be monitored by its host kernel for system-defined exception
conditions. When any of these conditions arise, other objects are
notified through signals or messages, as specified by the
programmer.
Short Code: [Tripathi:94a]
Reference: Vol. 6, No. 4, pp. 325--338
Existing techniques for allocating processors in parallel and distributed systems are not suitable for use in large distributed systems. In such systems, dedicated multiprocessors should exist as an integral component of the distributed system, and idle processors should be available to applications that need them. The Prospero Resource Manager (PRM) is a scalable resource allocation system that supports the allocation of processing resources in large networks and on multiprocessor systems.
PRM employs three types of managers---the job manager, the system
manager and the node manager---to manage resources in a distributed
system. Multiple independent instances of each type of manager
exist, reducing bottlenecks. When making scheduling decisions each
manager utilizes information most closely associated with the
entities for which it is responsible.
Short Code: [Neuman:94a]
Reference: Vol. 6, No. 4, pp. 339--355
It has been shown that protocol processing represents a severe
bottle-neck for high speed computer networks. The disadvantages of
currently proposed solutions are their incompatibility with
existing standardized protocol implementations, their complexity
and/or their inflexibility. One method of alleviating these
limitations is to have an adaptable protocol stack, as proposed in
the paper. Preliminary results are presented which show that
significant gains in throughput can be achieved while creating a
framework suitable for future applications.
Short Code: [Richards:94a]
Reference: Vol. 6, No. 4, pp. 357--373
Requirements of emerging applications together with rapid changes in
networking technology towards gigabit speeds require new adequate
transport systems. Integrated designs of transport services, protocol
architecture and implementation platforms are required by forthcoming
applications in high-speed network environments. The transport
subsystem PATROCLOS (parallel transport subsystem for cell based
high-speed networks) is designed with special emphasis on a high degree
of inherent parallelism to allow efficient implementations on
multiprocessor architectures combined with specialized hardware for
very time critical functions. The paper presents the new parallel
protocol architecture of PATROCLOS, an appropriate implementation
architecture based on transputer networks, and performance evaluation
results, which indicate high throughput values.
Short Code: [Braun:94a]
Reference: Vol. 6, No. 4, pp. 375--391
The paper presents a new approach that uses neural networks to predict the
performance of a number of dynamic decentralized load-balancing strategies.
A distributed multicomputer system using distributed load-balancing
strategies is represented by a unified analytical queuing model. A large
simulation data set is used to train a neural network using the
back-propagation learning the algorithm based on gradient descent. The
performance model using the predicted data from the neural network produces
the average response time of various load balancing algorithms under various
system parameters. The validation and comparison with simulation data show
that the neural network is very effective in predicting the performance of
dynamic load-balancing algorithms. Our work leads to interesting techniques
for designing load balancing schemes (for large distributed systems) that are
computationally very expensive to simulate. One of the important findings is
that performance is affected least by the number of nodes, and most by the
number of links at each node in a large distributed system.
Short Code: [Ahmad:94a]
Reference: Vol. 6, No. 5, pp. 393--409 (C157)
The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems.
High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific.
One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues.
We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to.
An additional level of abstraction is achieved using the programming language C++, an object-oriented language (Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects.
We coded BLAS on top of VMMP. This code was run without any modifications on
two shared memory machines---the commercial Sequent Symmetry and the
experimental Taunop. (The code should run on any machine the VMMP was ported
onto, given the availability of a C++ compiler). Performance results
for this implementation are given. The speed-up of the BLAS-3 routines,
tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89.
Application programs (e.g., Cholesky factorization) using the library
routines achieved similar efficiency.
Short Code: [Averbuch:94a]
Reference: Vol. 6, No. 5, pp. 411--459 (C176)
Parallel computers will not become widely used until scientists and
engineers adopt a common programming language for publication of parallel
scientific algorithms. This paper describes the publication language
SuperPascal by examples. SuperPascal extends Pascal with deterministic
statements for parallel processes and synchronous message communication. The
language permits unrestricted combinations of recursive procedures and
parallel statements. SuperPascal omits ambiguous and insecure features of
Pascal. Restrictions on the use of variables enable a single-pass compiler
to check that parallel processes are disjoint, even if the processes use
procedures with global variables. A portable implementation of SuperPascal
has been developed on a Sun workstation under Unix.
Short Code: [Hansen:94b]
Reference: Vol. 6, No. 5, pp. 461--483 (C226)
We present three genetic algorithms (GAs) for allocating irregular
data sets to multiprocessors. These are a sequential hybrid GA, a
coarse-grain GA and a fine-grain GA. The last two are based on models
of natural evolution that are suitable for parallel implementation;
they have been implemented on a hypercube and a Connection Machine.
Experimental results show that the three GAs evolve good suboptimal
solutions which are better than those produced by other methods. The
GAs are also robust and do not show a bias towards particular problem
configurations. The two parallel GAs have reasonable execution times,
with the coarse-grain GA producing better solutions for the allocation
of loosely synchronous computations.
Short Code: [Mansour:94a]
Reference: Vol. 6, No. 6, pp. 485--504 (C190)
The paper examines the `Distributed Summation' problem, and its
solution in Linda. A number of problems with the current set of
primitives are examined, and a new primitive, collect, is
proposed.
Short Code: [Butcher:94a]
Reference: Vol. 6., No. 6, pp. 505--516 (C211)
In this paper, we present a parallel search scheme for model-based
interpretation of aerial images, following a focus-of-attention
paradigm. Interpretation is performed using the gray level image of
an aerial scene and its segmentation into connected components of
almost constant gray level. Candidate objects are generated from the
window as connected combinations of its components. Each candidate is
matched against the model by checking if the model constraints are
satisfied by the parameters computed from the region. The problem of
candidate generation and matching is posed as searching in the space
of combinations of connected components in the image, with finding an
(optimally) successful region as the goal. Our implementation
exploits parallelism at multiple levels by parallelizing the
management of the open list and other control tasks as well as the
task of model matching. We discuss and present the implementation of
the interpretation system on a Connection Machine CM-2. The
implementation reported a successful match in a few hundred
milliseconds whenever they existed.
Short Code: [Narayana:94a]
Reference: Vol. 6., No. 6, pp. 517--541 (C175)
The paper describes Parallel Universal Matrix Multiplication Algorithms
(PUMMA) on distributed memory concurrent computers. The PUMMA package
includes not only the non-transposed matrix multiplication routine
C=A.B, but also transposed multiplication routines C=A^T.B, C=A.B^T,
and C=A^T.B^T, for a book cyclic data distribution. The routines
perform efficiently for a wide range of processor configurations and
block sizes. The PUMMA together provide the same functionality as the
Level 3 BLAS routine xGEMM. Details of the parallel implementation of
the routines are given, and results are presented for runs on the Intel
Touchstone Delta computer.
Short Code: [Choi:94a]
Reference: Vol. 6., No. 7, pp. 543--570 (C177)
Matrix multiplication is a key primitive in block matrix algorithms
such as those found in LAPACK. We present results from our study of
matrix multiplication algorithms on the Intel Touchstone Delta, a
distributed memory message-passing architecture with a two-dimensional
mesh topology. We analyze and compare three algorithms and obtain an
implementation, BiMMeR, that uses communication primitives highly
suited to the Delta and exploits the single node assembly-coded matrix
multiplication. Our algorithm is completely general, i.e. able to
deal with various data layouts as well as arbitrary mesh aspect ratios
and matrix dimensions, and has achieved parallel efficiency of 86%,
with overall peak performance in excess of 8 Gflops on 256 nodes for
an $8800 x 8800$ matrix. We describe BiMMeR's design and
implementation and present performance results that demonstrate
scalability and robust behavior over varying mesh topologies.
Short Code: [Huss:94a]
Reference: Vol. 6., No. 7, pp. 571--594 (C180)
The paper looks at the problem of ensuring the performance of
real-time applications hosted on Galactica Net, a mesh-based
distributed cache coherent shared memory multiprocessing system. A
method for determining strict upper bounds on worst case latencies in
wormhole routed networks of known or unknown communication patterns is
presented. From this, a tool for determining upper bounds for shared
memory update latencies is developed, and it is shown that the update
latency of Galactica Net is deterministic. The analytical bounds are
then compared with maximum latencies observed in simulations of GNet,
with which they compare favorably. Finally, it is shown that the tool
for determining update latency bounds is useful for comparing
differing GNet system configurations in order to minimize update
latency bounds.
Short Code: [Clayton:94a]
Reference: Vol. 6., No. 7, pp. 595--611 (C161)
Dynamic load balancing is an important technique when developing
applications with unpredictable load distribution on distributed
memory multicomputers. A tool, Dynamo, that can be used to
utilize dynamic load balancing is presented. This tool separates the
application from the load balancer and thus makes it possible to
easily exchange the load balancer of a given application and
experiment with different load balancing strategies. A prototype of
Dynamo has been implemented in the C language on an Intel iPSC/2
Hypercube. Dynamo is demonstrated by two example programs. The first
program solves the $N$ queen problem using a backtracking algorithm
and the second solves a 0--1 knapsack problem using a depth-first
branch and bound algorithm.
Short Code: [Tarnvik:94a]
Reference: Vol. 6., No. 8, pp. 613--639 (C185)
The paper presents the implementation of a new class of massively
parallel algorithms for solving certain time-dependent partial
differential equations (PDEs) on massively parallel supercomputers.
Such PDEs are usually solved numerically, by discretization in time
and space, and by applying a time-stepping procedure to data and
algorithms potentially parallelized in the spatial domain. In a
radical departure from such a strictly sequential temporal paradigm,
we have developed a concept of time-parallel algorithms, which allows
the marching in time to be fully parallelized. This is achieved by
using a set of transformations based on eigenvalue-eigenvector
decomposition of the matrices involved in the discrete formalism. Our
time-parallel algorithms possess a highly decoupled structure, and can
therefore be efficiently implemented on emerging, massively parallel,
high-performance supercomputers, with a minimum of communication and
synchronization overhead. We have successfully carried out a
proof-of-concept demonstration of the basic ideas using a
two-dimensional heat equation example implemented on the Intel
Touchstone Delta supercomputer. Our results indicate that linear, and
even superlinear, speed-up can be achieved and maintained for a very
large number of processor nodes.
Short Code: [Toomarian:94a]
Reference: Vol. 6., No. 8, pp. 641--652 (C195)
The paper deals with the implementation of global time in
multicomputer systems. After a formalization of the synchronization
problem, techniques to estimate the synchronization delay and to
compensate the drift error are proposed. Then SYNC_WAVE, a clock
synchronization algorithm where the values of a reference clock are
diffused in a wave-like manner, is described. SYNC_WAVE has no
provision for fault-tolerance and is specially designed to introduce
low CPU and communication overhead, in order to support performance
analysis applications efficiently. An implementation of the devised
algorithm in a transputer-based system is presented, showing the
accuracy results obtained. Finally SYNC_WAVE is compared to other
synchronization algorithms and several of its possible applications
are suggested.
Short Code: [Pietro:94a]
Reference: Vol. 6., No. 8, pp. 653--671 (C222)
The paper is concerned with the design and implementation of a
parallel dynamic programming algorithm for use in ship voyage
management. The basic concepts are presented in terms of a simple
model for weather routing. Other factors involved in voyage
management, and their inclusion in a more comprehensive algorithm, are
also discussed. The algorithms have been developed and implemented
using a transputer-based distributed-memory parallel machine
using the high-level communication harness CS Tools. Trial
calculations over grids of up to 282 nodes have been carried out and
the results are presented. Good speed-ups for the calculations have
been attained, and the factors affecting the efficiency of the
parallel computations are reviewed. These trial calculations indicate
that a ship voyage management system based on parallel dynamic
programming is likely to be beneficial.
Short Code: [Barbier:94a]
Reference: Vol. 6., No. 8, pp. 673--696 (C223)