Empirical Analysis of Overheads in Cluster Environments

B. K. Schmidt and V. S. Sunderam

In concurrent computing environments based on heterogeneous processing elements interconnected by general-purpose networks, several classes of overheads contribute to lowered performance. The most obvious limitations are network throughput and latency, but certain other factors also play a significant role. In an attempt to gain some insight into the nature of these overheads, and to propose strategies to alleviate them, empirical measurements of native communication performance as well as application execution performance were conducted, using the PVM network computing system.

These experiments and our analyses have identified load imbalance, the parallelism model adopted, communication delay and throughput, and within-host overheads as the primary factors affecting performance in cluster environments. Interestingly, we find that agenda parallelism and load balancing strategies contribute significantly more to better performance than improved communications or system tuning. Drawing general conclusions on how these inefficiencies may be overcome is inadvisable because of the tremendous variability of many parameters in general purpose network environments; we therefore propose several potential approaches, including model selection criteria, partitioning strategies, and software system heuristics. to reduce overheads and enhance performance in network based environments.

Short Code: [Schmidt:94a]
Reference: Vol. 6, No. 1, pp. 1--32 (C153)

Experiments with Program Unification on the CRAY Y-MP

Ling-Yu Chuang, Vernon Rego, and Aditya Mathur

Program unification is a technique for source-to-source transformation of code for enhanced execution performance on vector and SIMD architectures. This work focuses on simple examples of program unification to explain the methodology and demonstrate its promise as a practical technique for improved performance. Using simple examples to explain how unification is done, we outline two experiments in the simulation. Empirical tests of unified code on a CRAY Y-MP multiprocessor show that unification improves execution performance by a factor of roughly 8 for given application. The technique is general in that it can be applied to computation-intensive programs in various data-parallel application domains.

Short Code: [Chuang:94a]
Reference: Vol. 6, No. 1, pp. 33--53 (C159)

Solving an Advection-Diffusion Problem on the Connection Machine

Martin Berggren

An algorithm is presented that solves a linear advection-diffusion problem using a least-squares formulation and a conjugate gradient method to solve the corresponding minimization problem. An implementation in CM-Fortran on a Thinking Machines CM-2 is compared with a serial implementation on an IBM RS6000. The maximum speed-up obtained is a factor of 70. For fine grids, the CPU time scales almost ideally when the number of processors is increased from 4096 to 8192.

Short Code: [Berggren:94a]
Reference: Vol. 6, No. 1, pp. 55--68 (C160)

A Parallel Block Row-action Method for Solving Large Sparse Linear systems on Distributed Memory Multiprocessors

Marco D'Apuzzo and Maria Assunta de Rosa

Recently developed block-iterative versions of some row-action algorithms for solving general systems of sparse linear equations allow parallelism in the computations when the underlying problem is appropriately decomposed. However, problems associated with the parallel implementation of these algorithms have to be addressed.

In this paper we present an implementation on distributed memory multiprocessors of a block version of the Kaczmarz row-action method. One of the main issues related to the efficient implementation of this method on a concurrent environment is to develop suitable communication schemes in order to reduce the amount of communication needed at each iteration.

We propose two data distribution strategies which lead to different computation and communication schemes.

To verify and compare the effectiveness of the proposed strategies, numerical experiments have been carried out on a Symult S2010 and a Meiko Computing Surface. The performance evaluation has been done using a scaled efficiency model.

Short Code: [Apuzzo:84a]
Reference: Vol. 6, No. 1, pp. 69--84 (C77)

High-performance Emulation of Hierarchical Structures on Hypercube Supercomputers

S. G. Ziavras and D. P. Shah

The problem of emulating multilevel structures on hypercube supercomputers is studied in the paper. The frequently used pyramid belongs to the class of multilevel structures. Several algorithms have been published in the literature for the emulation of pyramids by hypercubes. The paper extends the most important of these algorithms to make them applicable for multilevel structures. Results for the Connection Machine system CM-2 with 16,384 processors are presented and comparative analysis of the new algorithms is carried out. It is shown that very often higher performance can be obtained for multilevel structures other than the pyramid.

Short Code: [Ziavras:84a]
Reference: Vol. 6, No. 2, pp. 85--100 (C77)

Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems

Stephen T. Barnard and Horst D. Simon

If problems involving unstructured meshes are to be solved efficiently on distributed-memory parallel computers, the meshes must be partitioned and distributed across processors in a way that balances the computational load and minimizes communication. The recursive spectral bisection method (RSB) has been shown to be very effective for such partitioning problems compared to alternative methods, but RSB in its simplest form is expensive. Here a multilevel version of RSB is introduced that attains about an order-of-magnitude improvement in run time on typical examples.

Short Code: [Barnard:84a]
Reference: Vol. 6, No. 2, pp. 101--117 (C163)

Parallel Simulation Based on Conservative Time Windows: A Performance Study

Rassul Ayani and Hassan Rajaei

The paper presents a Conservative Time Window (CTW) algorithm for parallel simulation of discrete event systems. The physical system to be simulated is partitioned into $n$ disjoint sub-systems, each of which is represented by an object. The CTW algorithm identifies a time window for each object, such that events occurring in each window are independent of events in other windows and thus they can be processed concurrently. The CTW algorithm was implemented on a shared memory multiprocessor, a Sequent Symmetry S81 with 16 processors. We measured performance of the CTW algorithm on two types of network topologies: feed-forward networks and networks with feedback loops. We used three metrics to measure performance: speed-up, average number of independent windows detected by the algorithm, and average number of events occurring in each window. We also investigated the impact of various event scheduling policies on performance. The results obtained show that the CTW algorithm produces good performance in many

Short Code: [Ayani:84a]
Reference: Vol. 6, No. 2, pp. 119--142 (C164)

Do Hypercubes Sort Faster Than Tree Machines?

Per Brinch Hansen

We develop a balanced, parallel quicksort algorithm for a hypercube and compare it with a similar algorithm for a binary tree machine. The performance of the hypercube algorithm is measured on a Computing Surface.

Short Code: [Hansen:84a]
Reference: Vol. 6, No. 2, pp. 143--151 (C199)

Massive MIMD Neural Network Simulations: The Connection Dilemma

T. Tollenaere, J. M. Saraiva, and M. M. Van Hulle

We present two strategies for the simulation of massive neural networks on message-passing MIMD machines. In the first strategy all interconnections between neurons are stored explicitly in interconnection matrices. During simulation, every processor is responsible for certain submatrices of these interconnection matrices. The fact that message-passing MIMD processors do not provide virtual memory seriously limits the size of the networks that can be simulated, since interconnection matrices require huge amounts of memory.

An alternative strategy is not to store the connections explicitly, but generate the interconnections as they are needed. This circumvents memory limitations, but because interconnections need to be generated multiple times, it is inherently slower than the first implementation.

This yields the connections dilemma: the choice between fast simulation of small networks as against slower simulation of massive networks.

We present, analyze and bench-mark parallel implementations for both strategies. An efficient connection-look-up algorithm, which can be used for any network with static interconnections, ensures that simulation times for the second strategy are only marginally longer than for the first strategy. We show that for our users the connections dilemma is no longer a dilemma: by means of our look-up algorithm the simulation of massive networks becomes possible; furthermore the time to design and construct a network, prior to simulation, is considerably shorter than it is for the matrix version, and in addition this time is independent of network size.

Although we have implemented both strategies on a parallel computer, the algorithms presented here can be used on any machine with memory limitations, such as personal computers.

Short Code: [Tollenaere:94a]
Reference: Vol. 6, No. 3, pp. 153--191 (C183)

Comparison of Parallel Fortran Environments on the AMT DAP510 for a Linear Algebra Application

M. Clint, J. S. Weston, and C. W. Bleakney

In the paper three distinct approaches to the implementation of an application in linear algebra on the AMT DAP 510 using the Fortran-Plus Enhanced language and compiler are investigated. In the first the implementation is tailored to fit a virtual array processor whose edge size corresponds to that of the maximum dimension of matrix used in the application. The second approach is a special case of the first, and corresponds to a situation in which the virtual array processor is constrained to have the same dimensions as the AMT DAP 510. The third approach may be considered to be a hybrid approach. In this case an implementation of the application is constructed which incorporates the most efficient features of the first two. Finally, comparisons of the performances of the three implementations, all of which were compiled using the Fortran-Plus Enhanced compiler, are presented.

Short Code: [Clint:94a]
Reference: Vol. 6, No. 3, pp. 193--204 (C196)

Data Management for a Class of Iterative Computations on Distributed-Memory MIMD Systems

M. C. Cornea-Hasegan, Dan C. Marinescu, and Zhongyun, Zhang

The paper discusses data management techniques for mapping a large data space onto the memory hierarchy of a distributed memory MIMD system. Experimental results for structural biology computations using the Molecular Replacement Method are presented.

Short Code: [Cornea:94a]
Reference: Vol. 6, No. 3, pp. 205--229 (C172)

Object-based Approach to Programming Distributed Systems

A. S. Tanenbaum, H. E. Bal, S. Ben Hassen, and M. Frans Kaashoek

Two kinds of parallel computers exist: those with shared memory and those without. The former are difficult to build but easy to program. The latter are easy to build but difficult to program. In this paper we present a hybrid model that combines the best properties of each by simulating a restricted object-based shared memory on machines that do not share physical memory. In this model, objects can be replicated on multiple machines. An operation that does not change an object can then be done locally, without any network traffic. Update operations can be done using the reliable broadcast protocol described in the paper. We have constructed a prototype system, designed and implemented a new programming language for it, and programmed various applications using it. The model, algorithms, language, applications and performance will be discussed.

Short Code: [Tanenbaum:94a]
Reference: Vol. 6, No. 4, pp. 235--249

Communication System for High-Performance Distributed Computing

S. Hariri, J.-B. Park, M. Parashar, and G. C. Fox

With the current advances in computer and networking technology coupled with the availability of software tools for parallel and distributed computing, there has been increased interest in high-performance distributed computing (HPDC). We envision that HPDC environments with supercomputing capabilities will be available in the near future. However, a number of issues have to be resolved before future network-based applications can fully exploit the potential of the HPDC environment. In the paper we present an architecture for a high-speed local area network and a communication system that provides HPDC applications with high bandwidth and low latency. We also characterize the message-passing primitives required in HPDC applications and develop a communication protocol that implements these primitives efficiently.

Short Code: [Hariri:94c]
Reference: Vol. 6, No. 4, pp. 251--270

Using Schooner to Support Distribution and Heterogeneity in the Numerical Propulsion System Simulation Project

P. T. Homer and R. D. Schlichting

The Numerical Propulsion System Simulation (NPSS) project has been initiated by NASA to explore the use of computer simulation in the development of new aircraft propulsion technology. With this approach, each engine component is modeled by a separate computational code, with a simulation executive connecting the codes and modeling component interactions. Since each code potentially executes on a different machine in a network, a simulation run is a heterogeneous distributed program in which diverse software and hardware elements are incorporated into a single computation. In the paper a prototype simulation executive that supports this type of programming is described. The two components of this executive are the AVS scientific visualization system and the Schooner heterogeneous remote procedure call (RPC) facility. In addition, the match between Schooner's capabilities and the needs of NPSS is evaluated based on our experience with a collection of test codes. The basic conclusion is that, while Schooner fared well in general, it exhibited certain deficiencies that dictated changes in its design and implementation. This discussion not only documents the evolution of Schooner, but also serves to highlight the practical problems that can be encountered when dealing with heterogeneity and distribution in such applications.

Short Code: [Homer:94a]
Reference: Vol. 6, No. 4, pp. 271--287

Performance Benefits of Non-volatile Caches in Distributed File Systems

P. Biswas, D. Towsley, K. K. Ramakrishnan, and C. M. Krishna

We study the use of non-volatile memory for caching in distributed file systems. This provides an advantage over traditional distributed file systems in that the load is reduced at the server without making the data vulnerable to failures. We propose the use of a small non-volatile cache for writes, at the client and the file server, together with a larger volatile read cache to keep the cost of the caches reasonable.

We use a synthetic workload developed from analysis of file I/O traces from commercial production systems and use a detailed simulation of the distributed environment. The service times for the resources of the system were derived from measurements performed on a typical workstation.

We show that non-volatile write caches at the clients and the file server reduce the write response time and the load on the file server dramatically, thus improving the scalability of the system. We examine the comparative benefits of two alternative writeback policies for the non-volatile write cache. We show that a proposed threshold based writeback policy is more effective than a periodic writeback policy under heavy load. We also investigate the effect of varying the write cache size and show that introducing a small non-volatile cache at the client in conjunction with a moderate sized non-volatile server write cache improves the write response time by a factor of four at all load levels.

Short Code: [Biswas:94a]
Reference: Vol. 6, No. 4, pp. 289--323

Configuration Management in the Nexus Distributed Operating System

A. Tripathi, N. M. Karnik, S. P. Koneru, C. Nock, R. Tewari, K. Day, and T. Noonan

The paper describes the approach taken for configuration management in the Nexus distributed operating system. Nexus uses kernel-level support for monitoring the failure or termination status of distributed components of an application. Periodic user-level messages are not required for status monitoring. Group and dependency relationships between such components can be defined by the programmer for the purpose of configuration monitoring and management. An object belonging to a distributed application can be monitored by its host kernel for system-defined exception conditions. When any of these conditions arise, other objects are notified through signals or messages, as specified by the programmer.

Short Code: [Tripathi:94a]
Reference: Vol. 6, No. 4, pp. 325--338

The Prospero Resource Manager: A Scalable Framework for Processor Allocation in Distributed Systems

B. C. Neuman and S. Rao

Existing techniques for allocating processors in parallel and distributed systems are not suitable for use in large distributed systems. In such systems, dedicated multiprocessors should exist as an integral component of the distributed system, and idle processors should be available to applications that need them. The Prospero Resource Manager (PRM) is a scalable resource allocation system that supports the allocation of processing resources in large networks and on multiprocessor systems.

PRM employs three types of managers---the job manager, the system manager and the node manager---to manage resources in a distributed system. Multiple independent instances of each type of manager exist, reducing bottlenecks. When making scheduling decisions each manager utilizes information most closely associated with the entities for which it is responsible.

Short Code: [Neuman:94a]
Reference: Vol. 6, No. 4, pp. 339--355

Adaptive Transport Service for High Speed Networks

A. Richards, T. Ginige, A. Seneviratne, T. Buczkowska, and M. Fry

It has been shown that protocol processing represents a severe bottle-neck for high speed computer networks. The disadvantages of currently proposed solutions are their incompatibility with existing standardized protocol implementations, their complexity and/or their inflexibility. One method of alleviating these limitations is to have an adaptable protocol stack, as proposed in the paper. Preliminary results are presented which show that significant gains in throughput can be achieved while creating a framework suitable for future applications.

Short Code: [Richards:94a]
Reference: Vol. 6, No. 4, pp. 357--373

Parallel Transport Subsystem Implementation for High-Performance Communication

T. Braun and C. Schmidt

Requirements of emerging applications together with rapid changes in networking technology towards gigabit speeds require new adequate transport systems. Integrated designs of transport services, protocol architecture and implementation platforms are required by forthcoming applications in high-speed network environments. The transport subsystem PATROCLOS (parallel transport subsystem for cell based high-speed networks) is designed with special emphasis on a high degree of inherent parallelism to allow efficient implementations on multiprocessor architectures combined with specialized hardware for very time critical functions. The paper presents the new parallel protocol architecture of PATROCLOS, an appropriate implementation architecture based on transputer networks, and performance evaluation results, which indicate high throughput values.

Short Code: [Braun:94a]
Reference: Vol. 6, No. 4, pp. 375--391

Performance Modeling of Load-balancing Algorithms using Neural Networks

I. Ahmad, A. Ghafoor, K. Mehrotra, C. K. Mohan and S. Ranka

The paper presents a new approach that uses neural networks to predict the performance of a number of dynamic decentralized load-balancing strategies. A distributed multicomputer system using distributed load-balancing strategies is represented by a unified analytical queuing model. A large simulation data set is used to train a neural network using the back-propagation learning the algorithm based on gradient descent. The performance model using the predicted data from the neural network produces the average response time of various load balancing algorithms under various system parameters. The validation and comparison with simulation data show that the neural network is very effective in predicting the performance of dynamic load-balancing algorithms. Our work leads to interesting techniques for designing load balancing schemes (for large distributed systems) that are computationally very expensive to simulate. One of the important findings is that performance is affected least by the number of nodes, and most by the number of links at each node in a large distributed system.

Short Code: [Ahmad:94a]
Reference: Vol. 6, No. 5, pp. 393--409 (C157)

Portable Parallel Implementation of BLAS 3

Amir Averbuch, Dganit Amitai, Ronen Friedman, and Eran Gaber

The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems.

High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific.

One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues.

We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to.

An additional level of abstraction is achieved using the programming language C++, an object-oriented language (Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects.

We coded BLAS on top of VMMP. This code was run without any modifications on two shared memory machines---the commercial Sequent Symmetry and the experimental Taunop. (The code should run on any machine the VMMP was ported onto, given the availability of a C++ compiler). Performance results for this implementation are given. The speed-up of the BLAS-3 routines, tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89. Application programs (e.g., Cholesky factorization) using the library routines achieved similar efficiency.

Short Code: [Averbuch:94a]
Reference: Vol. 6, No. 5, pp. 411--459 (C176)

SuperPascal---A Publication Language for Parallel Scientific Computing

Per Brinch Hansen

Parallel computers will not become widely used until scientists and engineers adopt a common programming language for publication of parallel scientific algorithms. This paper describes the publication language SuperPascal by examples. SuperPascal extends Pascal with deterministic statements for parallel processes and synchronous message communication. The language permits unrestricted combinations of recursive procedures and parallel statements. SuperPascal omits ambiguous and insecure features of Pascal. Restrictions on the use of variables enable a single-pass compiler to check that parallel processes are disjoint, even if the processes use procedures with global variables. A portable implementation of SuperPascal has been developed on a Sun workstation under Unix.

Short Code: [Hansen:94b]
Reference: Vol. 6, No. 5, pp. 461--483 (C226)

Allocating Data to Distributed-memory Multiprocessors by Genetic Algorithms

N. Mansour and G. C. Fox

We present three genetic algorithms (GAs) for allocating irregular data sets to multiprocessors. These are a sequential hybrid GA, a coarse-grain GA and a fine-grain GA. The last two are based on models of natural evolution that are suitable for parallel implementation; they have been implemented on a hypercube and a Connection Machine. Experimental results show that the three GAs evolve good suboptimal solutions which are better than those produced by other methods. The GAs are also robust and do not show a bias towards particular problem configurations. The two parallel GAs have reasonable execution times, with the coarse-grain GA producing better solutions for the allocation of loosely synchronous computations.

Short Code: [Mansour:94a]
Reference: Vol. 6, No. 6, pp. 485--504 (C190)

Global Synchronisation in Linda

Paul Butcher, Alan Wood, and Martin Atkins

The paper examines the `Distributed Summation' problem, and its solution in Linda. A number of problems with the current set of primitives are examined, and a new primitive, collect, is proposed.

Short Code: [Butcher:94a]
Reference: Vol. 6., No. 6, pp. 505--516 (C211)

Parallel Search for the Interpretation of Aerial Images

P. J. Narayanan and Larry S. Davis

In this paper, we present a parallel search scheme for model-based interpretation of aerial images, following a focus-of-attention paradigm. Interpretation is performed using the gray level image of an aerial scene and its segmentation into connected components of almost constant gray level. Candidate objects are generated from the window as connected combinations of its components. Each candidate is matched against the model by checking if the model constraints are satisfied by the parameters computed from the region. The problem of candidate generation and matching is posed as searching in the space of combinations of connected components in the image, with finding an (optimally) successful region as the goal. Our implementation exploits parallelism at multiple levels by parallelizing the management of the open list and other control tasks as well as the task of model matching. We discuss and present the implementation of the interpretation system on a Connection Machine CM-2. The implementation reported a successful match in a few hundred milliseconds whenever they existed.

Short Code: [Narayana:94a]
Reference: Vol. 6., No. 6, pp. 517--541 (C175)

Pumma: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers

J. Choi, J. J. Dongarra, and D. W. Walker

The paper describes Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C=A.B, but also transposed multiplication routines C=A^T.B, C=A.B^T, and C=A^T.B^T, for a book cyclic data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.

Short Code: [Choi:94a]
Reference: Vol. 6., No. 7, pp. 543--570 (C177)

Matrix Multiplication on the Intel Touchstone Delta

S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhang

Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We analyze and compare three algorithms and obtain an implementation, BiMMeR, that uses communication primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, i.e. able to deal with various data layouts as well as arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86%, with overall peak performance in excess of 8 Gflops on 256 nodes for an $8800 x 8800$ matrix. We describe BiMMeR's design and implementation and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.

Short Code: [Huss:94a]
Reference: Vol. 6., No. 7, pp. 571--594 (C180)

Determining Update Latency Bounds in Galactica Net

S. Clayton, R. J. Duckworth, W. Michalson, and A. Wilson

The paper looks at the problem of ensuring the performance of real-time applications hosted on Galactica Net, a mesh-based distributed cache coherent shared memory multiprocessing system. A method for determining strict upper bounds on worst case latencies in wormhole routed networks of known or unknown communication patterns is presented. From this, a tool for determining upper bounds for shared memory update latencies is developed, and it is shown that the update latency of Galactica Net is deterministic. The analytical bounds are then compared with maximum latencies observed in simulations of GNet, with which they compare favorably. Finally, it is shown that the tool for determining update latency bounds is useful for comparing differing GNet system configurations in order to minimize update latency bounds.

Short Code: [Clayton:94a]
Reference: Vol. 6., No. 7, pp. 595--611 (C161)

Dynamo---A Portable Tool for Dynamic Load Balancing on Distributed Memory Multicomputers

Erik Tarnvik

Dynamic load balancing is an important technique when developing applications with unpredictable load distribution on distributed memory multicomputers. A tool, Dynamo, that can be used to utilize dynamic load balancing is presented. This tool separates the application from the load balancer and thus makes it possible to easily exchange the load balancer of a given application and experiment with different load balancing strategies. A prototype of Dynamo has been implemented in the C language on an Intel iPSC/2 Hypercube. Dynamo is demonstrated by two example programs. The first program solves the $N$ queen problem using a backtracking algorithm and the second solves a 0--1 knapsack problem using a depth-first branch and bound algorithm.

Short Code: [Tarnvik:94a]
Reference: Vol. 6., No. 8, pp. 613--639 (C185)

Time-Parallel Solution of Linear Partial Differential Equations on the Intel Touchstone Delta Supercomputer

N. Toomarian, A. Fijany, and J. Barhen

The paper presents the implementation of a new class of massively parallel algorithms for solving certain time-dependent partial differential equations (PDEs) on massively parallel supercomputers. Such PDEs are usually solved numerically, by discretization in time and space, and by applying a time-stepping procedure to data and algorithms potentially parallelized in the spatial domain. In a radical departure from such a strictly sequential temporal paradigm, we have developed a concept of time-parallel algorithms, which allows the marching in time to be fully parallelized. This is achieved by using a set of transformations based on eigenvalue-eigenvector decomposition of the matrices involved in the discrete formalism. Our time-parallel algorithms possess a highly decoupled structure, and can therefore be efficiently implemented on emerging, massively parallel, high-performance supercomputers, with a minimum of communication and synchronization overhead. We have successfully carried out a proof-of-concept demonstration of the basic ideas using a two-dimensional heat equation example implemented on the Intel Touchstone Delta supercomputer. Our results indicate that linear, and even superlinear, speed-up can be achieved and maintained for a very large number of processor nodes.

Short Code: [Toomarian:94a]
Reference: Vol. 6., No. 8, pp. 641--652 (C195)

A Clock Synchronization Algorithm for the Performance Analysis of Multicomputer Systems

Giuseppe de Pietro and Umberto Villano

The paper deals with the implementation of global time in multicomputer systems. After a formalization of the synchronization problem, techniques to estimate the synchronization delay and to compensate the drift error are proposed. Then SYNC_WAVE, a clock synchronization algorithm where the values of a reference clock are diffused in a wave-like manner, is described. SYNC_WAVE has no provision for fault-tolerance and is specially designed to introduce low CPU and communication overhead, in order to support performance analysis applications efficiently. An implementation of the devised algorithm in a transputer-based system is presented, showing the accuracy results obtained. Finally SYNC_WAVE is compared to other synchronization algorithms and several of its possible applications are suggested.

Short Code: [Pietro:94a]
Reference: Vol. 6., No. 8, pp. 653--671 (C222)

Parallel Dynamic Programming and Ship Voyage Management

C. Barbier, P. Sen, and M. Downie

The paper is concerned with the design and implementation of a parallel dynamic programming algorithm for use in ship voyage management. The basic concepts are presented in terms of a simple model for weather routing. Other factors involved in voyage management, and their inclusion in a more comprehensive algorithm, are also discussed. The algorithms have been developed and implemented using a transputer-based distributed-memory parallel machine using the high-level communication harness CS Tools. Trial calculations over grids of up to 282 nodes have been carried out and the results are presented. Good speed-ups for the calculations have been attained, and the factors affecting the efficiency of the parallel computations are reviewed. These trial calculations indicate that a ship voyage management system based on parallel dynamic programming is likely to be beneficial.

Short Code: [Barbier:94a]
Reference: Vol. 6., No. 8, pp. 673--696 (C223)

Return to CPE Hompage

Last updated 25th August 1995 by: mab@npac.syr.edu