Published Abstracts for 1997

Concurrency: Practice and Experience

1997 Publications

Application-Level Load Migration and Its Implementation on Top of PVM

J. Song, H. K. Choo, and K. M. Lee

The development and experiment of a load (process) migration scheme conceptually similar to moving house is described. The basic idea is to migrate a process by starting a new process on another processor with checkpoint data prepared by the old process itself but transferred automatically by the migration system. The new process will then unpack the data and resume the computation. The migration mechanism of our facility is implemented by a set of library calls on top of PVM. It performs functions such as freezing and unfreezing communications, checking load conditions, selecting destination processors, starting new processes, and receiving migrated data. Before migrating, a process needs to freeze communication, handle pending messages in the receive buffer, and pack checkpoint data. Besides the usual merits of concurrency, location transparency, and the absence of residual dependency, our scheme solves the incoming message problem at the application level and is portable and easy to use in a heterogeneous environment. Our experiment shows that our facility can help to utilize 74% of idle CPU cycles of a network of workstations with less than 6% overhead to their normal operations.

Short Code: [Song:97a]

Reference: Vol. 9, No. 1, pp. 1-19 (C277)

Variable Instruction Scheduling for MIMD Interpretation on Pipelined SIMD Machines and for Compositional Instruction Sets

Nael B. Abu-Ghazaleh, and Philip A. Wilsey

Functional parallelism may be supported on SIMD machines by interpretation. The programs and data of each function are loaded on the processing elements (PEs) and the control unit of the machine executes a central control algorithm that causes the concurrent interpretation of these functions. The performance of this paradigm has been shown to benefit considerably from a variable instruction issue schedule that delays execution of expensive and rarely occuring operations. Two new features of the interpretation paradigm, namely pipelined SIMD machines and compositional instruction sets, change the nature of the mathematical model used for variable instruction scheduling significantly. In this paper, a previously developed mathematical model of the interpretation process is extended to allow for compositional instructions and pipelining. We develop and present algorithms that produce variable instruction schedules for the extended model and investigate whether variable instruction issue is useful for these cases. We show that the variable instruction issue improves the performance of pipelined machines but is not very effective for compositional instruction sets, especially when the composition matrix is not sparse.

Short Code: [Abu-Ghazaleh:97a]

Reference: Vol. 9, No. 1, pp. 21-39 (C274)

Dimension-Exchange-Based Global Load Balancing on Injured Hypercubes

Jie Wu

A study is made of a global load balancing scheme on hypercubes with faulty links based on dimension exchange, where each node exchanges workloads with its neighbors along a selected dimension in such a way that their workloads become equal. A global load balancing algorithm that can tolerate n-1 faulty links is first presented. It is then extended to connected hypercubes with up to 2n-3 faulty links. Comparisons between the proposed scheme with the regular dimension-exchange-based scheme are also presented. Simulation results show that the average number of message exchanges required in the proposed scheme is very close to the one obtained from the regular dimension-exchange-based scheme.

Short Code: [Wu:97a]

Reference: Vol. 9, No. 1, pp. 41-61 (C275)

GLU: A High-Level System for Granular Data-Parallel Programming

R. Jagannathan, C. Dodd, and I. Agi

We describe a high-level system for granular data-parallel programming called GLU in which parallel applications are described as succinct implicitly parallel intensional compositions using sequential imperative functions. We show how different architecture-independent parallel programs can be generated from the high-level application description. We also show how GLU enables functional debugging of parallel applications without requiring their parallel execution. To illustrate the efficiency of parallel programs generated in GLU, we consider the results of a performance study of three real parallel GLU applications, executed on two different parallel computers. Finally, we compare GLU to other very high-level systems for granular data-parallel programming.

Short Code: [Jagannathan:97a]

Reference: Vol. 9, No. 1, pp. 63-83 (C192)

Connection Resource Management for Compiler-Generated Communication

Susan Hinrichs

Traditionally parallel compilers have targeted a standard message passing communication library when generating communication code (e.g. PVM, MPI). The standard message passing model dynamically reserves communication resources for each message. For regular, repeating communication patterns, a static communication resource reservation model can be more efficient. By reserving resources once for many communication exchanges, the communication startup time is better amortized. Plus, with a global view of communication, the static model has a wider choice of routes. While the static resource reservation model can be a more efficient communication target for the compiler, this model reveals the problems of scheduling use of limited communication resources. This paper uses the abstraction of a communication resource to define two resource management problems and presents three algorithms that can be used by the compiler to address these problems. Initial measures of the effectiveness of these algorithms are presented from two programs for an 8x8 iWarp system.

Short Code: [Hinrichs:97a]

Reference: Vol. 9, No. 2, pp. 85-112 (C276)

Parallel Multidimensional Bisection

William Baritompa, and Sami Viitanen

A parallel framework of the multidimensional bisection global optimization method of Wood is presented. The idea is to split the bracketing region between processors, thus achieving parallel function evaluations. Using an OCCAM-2 implementation on the Hathi-2 multitransputer system we explore some different strategies for load balancing and present some results.

Short Code: [Baritompa:97a]

Reference: Vol. 9, No. 2, pp. 113-121 (C206)

Contextual Debugging and Analysis of Multithreaded Applications

M. Bednorz, A. Gwozdowski, and K. Zielinski

Multithreaded programs are especially difficult to test and debug. The aim of the paper is to present a new concept of multithreaded program analysis and debugging based on contextual visualisation of the program components that influence thread execution. For this purpose, a dedicated software package called MTV (multithreading viewer) has been designed and implemented. It performs above the run-time library level, and hence only a programmer's view of multiple threads of control execution may be analyzed. The paper presents tested program code instrumentation, communication and synchronization between the instrumented program and MTV. Next, a general concept of contextual visualisation of multithreaded programs has been elaborated. A scheme of the MTV cooperation with the monitored program is discussed. The user interface has been desribed. A representation of the multithreaded program state has been shown, and the capability of MTV for certain classes of error recognition has been specified and illustrated by a few examples. These examples have been not intended to be exhaustive, but they rather indicate the opportunitites to exploit MTV for analysis of complex applications. Short evaluation of the proposed contextual visualisation techniques with application to multithreaded program analysis concludes the paper.

Short Code: [Bednorz:97a]

Reference: Vol. 9, No. 2, pp. 123-139 (C258)

Message Based Cooperation Between Parallel Depth and Itensity Matching Algorithms

W. J. Austin, and A. M. Wallace

A parallel vision system for object recognition and location based on cooperative depth and intensity processing is described. The parallel algorithm for intensity data processing is based on generation of hypothesised matches between line junctions in the image and space curve intersections in the model. These hypotheses lead to back projection of the model and verification of promising hypotheses. The parallel algorithm for depth data processing is based on a tree search algorithm constrained by pairwise geometry between primitives. As each algorithm proceeds, partial results are interchanged to direct the other concurrent process to a more promising or more viable solution. The architecture has been implemented and evaluated on a multi-transputer machine, and is illustrated by several examples of pose definition of a test object.

Short Code: [Austin:97a]

Reference: Vol. 9, No. 2, pp. 141-162 (C298)

A PRAM Oriented Programming System

C. Leon, C. Rodriguez, F. Garcia, and F. de Sande

A PRAM-oriented programming language called ll and its implementation on transputer networks are presented. The approach taken is a compromise between efficiency and simplicity. The ll language has been conceived as a tool for the study, design, analysis, verification, and teaching of parallel algorithms. A method for the complexity analysis of ll programs called PRSW is introduced. The ll compiler guarantees the conservation of the PRSW complexity of the algorithms translated. Furthermore, the computational results illustrate the good behaviour of the system for PRAM algorithms.

Short Code: [Leon:97a]

Reference: Vol. 9, No. 3, pp. 163-179 (C299)

Load Balancing Schemes for Extrapolation Methods

Thomas Rauber and Gudule R¨unger

Solving initial value problems (IVPs) for ordinary differential equations (ODEs) has long been believed to be an inherently sequential procedure. But IVP solvers using the extrapolation method provide high quality solutions and offer a large potential for parallelism. In this paper, we present algorithms for extrapolation methods on distributed memory multiprocessors that combine different levels of parallelism. These algorithms differ mainly in the partitioning of the processors into groups which are responsible for the execution of the independent tasks of the extrapolation method. We present the algorithms in a compute-communicate scheme using appropriate primitives for the communication. A detailed analysis shows that a sophisticated load balancing scheme is required to achieve a good speedup. We describe an optimal method based on Lagrange multipliers, investigate several simple heuristic schemes, and compare the heuristic schemes with an estimation for the optimal solution. An implementation of these schemes on an Intel iPSC/860 confirms the predicted runtimes.

Short Code: [Rauber:95a]

Reference: Vol. 9, No. 3, pp. 181-202 (C300)

Contour Ranking on Coarse Grained Machines: A Case Study for Low-level Vision Computations

F. Hameed, S. E. Hambrusch, A. A. Khokhar, and J. N. Patel

In this paper we present parallel solutions for performing image contour ranking on coarse-grained machines. In contour ranking, a linear representation of the edge contours is generated from the edge contours of a raw image. We describe solutions that employ different divide-and-conquer approaches and that use different communication patterns. The combining step of the divide-and-conquer solutions uses efficient sequential techniques for merging information about subimages. The proposed solutions are implemented on Intel Delta and Intel Paragon machines. We discuss performance results and present scalability analysis using different image and machine sizes.

Short Code: [Hameed:95a]

Reference: Vol. 9, No. 3, pp. 203-221 (C263)

Experiences with a Wide Area Network Metacomputing Management Tool using IBM SP2 Parallel Systems

R. Baraglia, G. Faieta, M. Formica, and D. Laforenza

The need to deal with highly complex scientific problems has led some researchers to use geographically distributed computing resources as a single powerful parallel machine. The work metacomputing has been coined to describe this new approach. This paper outlines some experiences in metacomputing carried out on a wide area network. A brief overview is given of the main issues concerning metacomputing and some experiments in this field are reported. WAMM, the metacomputer visual interface we have developed, is described. Astrophysics is a particularly fruitful field for numerical-intensive computer simulation, because the systems under study, stars and galaxies, are not amenable to controlled laboratory experiments. An astrophysical application of the gravitational n-Body problem, developed in conjunction with the Scuola Normale Superiore of Pisa, that was used to test WAMM's utility in metacomputer management, and some performance figures related to the case study are given. Work related to WAMM and conclusions are presented.

Short Code: [Baraglia:97a]

Reference: Vol. 9, No. 3, pp. 223-239 (C272)

Parallel Generation of Adaptive Multiresolution Structures for Image Processing

Xi Li, Sotirios G. Ziavras, and Constantine N. Manikopoulos

In this paper we present an algorithm for creating region-adjacency-graph (RAG) pyramids on TurboNet, an experimental parallel computer system. Each level of these hierarchies of irregular tesselations is generated by independent stochastic processes that adapt the structure of the pyramid to the content of the image. RAGs can be used in multiresolution image analysis to extract connected components from labeled images. The implementation of the algorithm is discussed and performance results are presented for three different communication techniques which are supported by the TurboNet's hybrid architecture. The results indicate that efficient communications are vital to good performance of the algorithm.

Short Code: [Li:97a]

Reference: Vol. 9, No. 4, pp. 241-254 (C260)

SUMMA: Scalable Universal Matrix Multiplication Algorithm

R. A. van de Geijn, and J. Watts

In this paper, we give a straightforward, highly efficient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system.

Short Code: [Geijn:97a]

Reference: Vol. 9, No. 4, pp. 255-274 (C267)

Automatic Selection of Load Balancing Parameters using Compile-time and Run-time Information

B. S. Siegell, and P. A. Steenkiste

Clusters of workstations are emerging as an important architecture. Programming tools that aid in distributing applications on workstation clusters must address problems of mapping the application, heterogeneity and maximizing system utilization in the presence of varying resource availability. Both computation and communication capabilities may vary with time due to other applications competing for resources, so dynamic load balancing is a key requirement. For greatest benefit, the tool must support a relatively wide class of applications running on clusters with a range of computation and communication capabilities. We have developed a system that supports dynamic load balancing of distributed applications consisting of parallelized DOALL and DOACROSS loops. The focus of this paper is on how the system automatically determines key load balancing parameters using run-time information and information provided by programming tools such as a parallelizing compiler. The parameters discussed are the grain size of the application, the frequency of load balancing, and the parameters that control work movement. Our results are supported by measurements on an implementation for the Nectar system at Carnegie Mellon University and by simulation.

Short Code: [Siegell:97a]

Reference: Vol. 9, No. 4, pp. 275-317 (C305)

A Comparison of Optimization Heuristics for the Data Mapping Problem

Nikos Chrischoides, Nashat Mansour, and Geoffrey Fox

In this paper we compare the performance of six heuristics with suboptimal solutions for the data distribution of two-dimensional meshes that are used for the numerical solution of partial differential equations (PDEs) on multicomputers. The data mapping heuristics are evaluated with respect to seven criteria covering load balancing, interprocessor communication, flexibility and ease of use for a class of single-phase iterative PDE solvers. Our evaluation suggests that the simple and fast block distribution heuristic can be as effective as the other five complex and computational expensive algorithms.

Short Code: [Chrisochoides:97a]

Reference: Vol. 9, No. 5, pp. 319-343 (C215)

A Poly-Algorithm for Parallel Dense Matrix Multiplicationon Two-Dimensional Process Grid Topologies

Jin Li, Anthony Skjellum, and Robert D. Falgout

In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form

on two-dimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently into three categories according to the communication primitives used and thus we offer a taxonomy for this family of related algorithms. All these algorithms are represented in the data distribution independent approach and thus do not require a specific data distribution for correctness. The algorithmic compatibility condition result shown here ensures the correctness of the matrix multiplication. We define and extend the data distribution functions and introduce permutation compatibility and algorithmic compatibility. We also discuss permutation compatible data distribution (modified virtual 2D data distribution). We conclude that no single algorithm always achieves the best performance on different matrix and grid shapes. A practical approach to resolve this dilemma is to use poly-algorithms. We analyze the characteristics of each of these matrix multiplication algorithms and provide initial heuristics for using the poly-algorithm. All these matrix multiplication algorithms have been tested on the IBM SP2 system. The experimental results are presented in order to demonstrate their relative performance characteristics, motivating the combined value of the taxonomu and new algorithms introduced here.

Short Code: [Li:97b]

Reference: Vol. 9, No. 5, pp. 345-389 (C202)

A Block Jacobi Method on a Mesh of Processors

Domingo Giménez, Vicente Hernández, Robert van de Geijn, and Antonio M. Vidal

In this paper, we study the parallelization of the Jacobi method to solve the symmetric eigenvalue problem on a mesh of processors. To solve this problem obtaining a theoretical efficiency of 100% it is necessary to exploit the symmetry of the matrix. The only previous algorithm we know exploiting the symmetry on multicomputers is that of van de Geijn (1991), but that algorithm uses a storage scheme adequate for a logical ring of processors, so having a low scalability. In this paper we show how matrix symmetry can be exploited on a logical mesh of processors obtaining a higher scalability than that obtained with van de Geijn's algorithm. In addition, we show how the storage scheme exploiting the symmetry can be combined with a scheme by blocks to obtain a highly efficient and scalable Jacobi method for solving the symmetric eigenvalue problem for distributed memory parallel computers. We report performance results from the Intel Touchstone Delta, the iPSC/860, the Alliant FX/80 and the PARSYS SN-1040.

Short Code: [Gimenez:97a]

Reference: Vol. 9, No. 5, pp. 391-411 (C270)

Notebook Interfaces for Networked Scientific Computing: Design and WWW Implementation

S. Weerawarana, A. Joshi, E. N. Houstis, J. R. Rice, and A. C. Catlin

Advances in wired and wireless networking technologies are making networked computing the most common form of high performance computing. Similarly, software like Mosaic and Netscape have not only unified the networked computing landscape, but they made it available to the masses in a simple, machine independent way. These developments are changing the way we do computational science, learn, research, collaborate, access information and resources, the maintain local and global relations. We envision a scenario where large scale computational science and engineering applications like virtual classrooms and laboratories are ubiquitous, and information resources are accessible on-demand from anywhere. In this paper we present the design of a user interface that will be appropriate to this scenario. We argue that interfaces modeled on the pen and paper paradigm are suited in this context. Specifically, we present the software architecture of a notebook interface. We lay down the requirements for such an interface and present its implementation using the World Wide Web. A realization of the notebook model is presented for a problem solving environment (PDELab) to support the numerical simulation of PDE based applications on a network of heterogeneous high performance machines.

Short Code: [Weerawarana:97a]

Reference: Vol. 9, No. 7, pp. 675-695 (C280)

Performance Modeling of Dense Cholesky Factorization on the MasPar MP-2

Vivek Garg and David E. Schimmel

In this paper we study various implementations of Cholesky factorization on SIMD architectures. A submatrix algorithm is implemented on the MasPar MP-2 using both block and torus-wrap data mappings. Both LL^T and LDL^T (square root free) implementations of the algorithm are investigated. The execution times and performance results for the MP-2 are presented. The performance of these algorithms is characterized in terms of the problem size, machine size, and other machine dependent communication and computation parameters. Analysis for the communication and computation complexities for these algorithms is also presented, and models to predict the performance are derived. The torus-wrap mapped implementations outperformed the block approach for all problem sizes. The LDL^T implementation outperformed LL^T for small to medium problem sizes.

Short Code: [Garg:97a]

Reference: Vol. 9, No. 7, pp. 697-719 (C273)

Parallel Program Analysis on Workstation Clusters: Speedup Profiling and Latency Hiding

Roberto Togneri

Parallel programming on workstation clusters is subject to many factors and problems which determine the potential success or failure of any individual implementation. The most obvious problems are the difficulty in developing parallel algorithms and the high communication latency which may render such algorithms inefficient. In an attempt to address some of these issues we propose a strategy for estimating the potential speedup of a parallel program based on computation ande communication profiling. We show that our proposed strategy yields accurate estimates of the speedup. We also propose a complete communication model so that the speedup can be estimated under different programming inputs and show that moderately accurate estimates can be obtained. High communicatin latency is the major problem with workstation cluster computing. We attempt to examine this problem from the system level point of view and experimentally that latency hiding can allow almost full utilisation of the CPU resource even though individual programs may suffer from high communication latencies.

Short Code: [Togneri:97a]

Reference: Vol. 9, No. 7, pp. 721-751 (C294)

Algorithms for Solving a Spatial Optimisation Problem on a Parallel Computer

F. George, N. Radcliffe, M. Smith, M. Birkin, and M. Clarke

In a collaborative project between GMAP Ltd. and EPCC, an existing heuristic optimisation scheme for strategic resource planning was parallelised to run on the data parallel Connection Machine CM-200. The parallel software was found to run over 2,700 times faster than the original workstation software. This has allowed the exploration of complex business planning strategies at a national, rather than regional, level for the first time. The availability of a very fast evaluation program for planning solutions also enabled an investigation of the use of genetic algorithms in place of GMAP's existing heuristic optimisation scheme. The results of this study show that genetic algorithms can provide better quality solutions in terms of both predicted profit from the solution, and spatial diversity to provide a range of possible solutions. This paper discusses both the parallelisation of the original optimisation scheme and the use of genetic algorithms in place of this method.

Short Code: [George:97a]

Reference: Vol. 9, No. 8, pp. 753-780 (C301)

Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers

Xian-He Sun, and Stuti Moitra

Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machines. Measured results are reported in terms of execution time and speedup. Analytical study are conducted for different communication topologies and for different tridiagonal systems. The measured results match the analytical results closely. In addition to address implementation issues, performance considerations such as problem sizes and models of speedup are also discussed.

Short Code: [Sun:97a]

Reference: Vol. 9, No. 8, pp. 781-801 (C279)

A Benchmark Study Based on the Parallel Computation of the Vector Outer-Product A=uv^T Operation

Rudnei Dias da Cunha

In this paper we benchmark the performance of the Cray T3D, IBM 9076 SP/1 and Intel Paragon XP/S parallel computers, using implementations of parallel algorithms for the computation of the vector outer-product A=uv^T operation. The vector outer-product operation, although very simple in nature, requires the computation of a large number of floating-point operations and its parallelization induces a great level of communication between the processors. It is thus suited to measure the relative speed of the processor, memory subsystem and network capabilities of a parallel computer. It should not be considered a ``toy problem,'' since it arises in numerical methods in the context of the solution of systems of non-linear equations - still a difficult problem to solve. We present algorithms for both the explicit shared-memory and message-passing programming models together with theoretical computation models for those algoritjms. Actual experiments were run on those computers, using Fortran 77 implementations of the algorithms. The results obtained with these experiments show that due to the high degree of communication between the processors one needs a parallel computer with fast communications and carefully implemented data exchange routines. The theoretical computation model allows prediction of the speed-up to be obtained for some problem size on a given number of processors.

Short Code: [Cunha:97a]

Reference: Vol. 9, No. 8, pp. 803-819 (C295)

Parallel Join for IBGF Partitioned Relational Databases

M. Bozyigit, S. A. Mohammed, and M. Al-Tayyeb

This study is concerned with a parallel join operation where the subject relations are partitioned according to an Interpolation Based Grid File (IBGF) scheme. The partitioned relations and directories are distributed over a set of independently accessible external storage units, together with the partitioning control data. The join algorithms executed by a mesh type parallel computing system allow handling of uniform as well as nonuniformly partitioned relations. Each processor locates and retrieves the data partitions it is to join at each step of the join process, in synchronization with other processors. The approach is found to be feasible as the speedup and efficiency results found by simulation are consistent with theoretical bounds. The algorithms are tuned to join-key distributions, so that effective load balancing is achieved during the actual join.

Short Code: [Bozyigit:97a]

Reference: Vol. 9, No. 8, pp. 821-836 (C324)