Next: About this document
Concurrency: Practice and Experience
Published Abstracts for 1996
Exploiting Application Parallelism in Knowledge-based
Systems: An Experimental Method
J. W. H. Daniel and M. R. Moulding
A method is presented which aims to enhance the runtime performance of
real-time production systems by utilizing natural concurrency in the
application knowledge base. This exploiting application parallelism (EAP)
method includes an automated analysis of the knowledge base and the use of
this analysis information to partition and execute rules on a novel parallel
production system (PPS) architecture. Prototype analysis tools and a PPS
simulator have been developed for the Inference ART environment in order to
apply the method to a Naval data-fusion problem. The results of this
experimental investigation revealed that an average maximum of 12.06
rule-firings/cycle was possible but, due to serial bottlenecks inherent in
the data-fusion problem, up to only 2.14 rule-firings/cycle was achieved
overall. Limitations of the EAP method are discussed within the context of
the experimental results and an enhanced method is investigated.
Short Code: [Daniel:96a]
Reference: Vol. 8, No. 1, pp. 1-18 (C257)
Native and Generic Parallel Programming Environments
on a Transputer and a PowerPC Platform
A. G. Hoekstra, P. M. A. Sloot, F. van der Linden, M.
van Muiswinkel, J. J. J. Vesseur, and L. O. Hertzberger
Genericity of parallel programming environments, enabling
development of portable parallel programs, is expected to result in
performance penalties. Furthermore, programmability and tool support
of programming environments are important issues if a choice between
programming environments has to be made. In this paper we propose a
methodology to compare native and generic parallel programming
environments, taking into account such competing issues as portability
and performance. As a case study, this paper compares the
Iserver-Occam, Parix, Express, and PVM parallel programming
environments on a 512-node Parsytec GCel. Furthermore, we apply our
methodology to compare Parix and PVM on a new architecture, a 32-node
Parsytec PowerXplorer, which is based on the PowerPC chip. In our
approach we start with a representative application and isolate the
basic (environment)-dependent building blocks. These basic building
blocks, which depend on floating-point performance and communication
capabilities of the environments, are analyzed independently. We have
measured point-to-point communication times, global communication times
and floating-point performance. All information is combined into a
time complexity analysis, allowing the comparison of the environments
on different degrees of functionality. Together with demands for
portability of the code and development time (i.e., programmability),
an overall judgment of the environments is given.
Short Code: [Hoekstra:96a]
Reference: Vol. 8, No. 1, pp. 19--46 (C241)
Benchmarking the Computation and Communication
Performance of CM-5
Kivanc Dincer, Zeki Bozkus, Sanjay Ranka, and Geoffrey Fox
Thinking Machines' CM-5 machine is a distributed-memory,
message-passing computer. In this paper we devise a performance
benchmark for the base and vector units and the data communication
networks of the CM-5 machine. We model the communication
characteristics such as communication latency and bandwidths of
point-to-point and global communication primitives. We show, on a
simple Gaussian elimination code, that an accurate static performance
estimation of parallel algorithms is possible by using those basic
machine properties connected with computation, vectorization,
communication, and synchronization. Furthermore, we describe the
embedding of meshes or hypercubes on the CM-5 fat-tree topology and
illustrate the performance results of their basic communication
primitives.
Short Code: [Dincer:96a]
Reference: Vol. 8, No. 1, pp. 47--69 (C156)
Building a Global Clock for Observing Computations in Distributed
Memory Parallel Computers
Jean-Marc Jézéquel and Claude Jard
A common time reference (i.e., a global clock) is needed for observing
the behavior of a distributed algorithm on a distributed computing
system. This paper presents a pragmatic algorithm to build a global
time on any distributed system, which is optimal for homogeneous
distributed memory parallel computers (DMPC). In order to observe and
sort concurrent events in common DMPCs, we need a global clock with a
resolution finer than the message transfer time variance, which is
better than what deterministic and fault-tolerant algorithms can obtain.
Thus a statistical method is chosen as a building block to derive an
original algorithm valid for any topology. Its main originality over
related approaches is to cope with the problem of clock granularity in
computing frequency offsets between local clocks to achieve a
resolution comparable with the resolution of the physical clocks. This
algorithm is particularly well suited for debugging distributed
algorithms by means of trace recordings because after its acquisition
step, it does not induce message overhead: the perturbation induced on
the execution remains as small as possible. It has been implemented on
various DMPCs: Intel iPSC/2 hypercube and Paragon XP/S, Transputer-based
networks and Sun networks, so we can provide some data about its
behavior and performances on these DMPCs.
Short Code: [Jezequel:96a]
Reference: Vol. 8, No. 1, pp. 71--89 (C201)
Portable Parallelizing Fortran Compiler
A. Averbuch, R. Dekel, and E. Gabber
The Portable Parallelizing Fortran Compiler (PPFC) is an
additional component for the portable programming environment
developed in Tel-Aviv University for scientific code. This environment
supports portable and efficient programming of diverse MIMD
multiprocessors, both distributed- and shared-memory.
Till now this environment consisted of two tools: The Virtual
Machine for MultiProcessors (VMMP), and the Portable Parallelizing
Pascal Compiler (). We added the PPFC which is an automatic
parallelizer compiler for the Fortran language. The compiler is fully
automatic (does not require additional declarations to assist
parallelization), which is characterized by loops operating on regular
data structures, and produces efficient and portable code for a variety
of multiprocessors from the same serial code.
The parallel implementation uses the VMMP which is a software package
that provides a coherent set of services for explicitly parallel
application programs running on diverse MIMD multiprocessors. VMMP is
intended to simplify parallel program writing and to promote portable
and efficient programming.
The PPFC parallelized 12 out of the 24 Livermore Loops. It
was also applied to parallelize all the 14 Fortran application
programs that were parallelized by the and achieved the same
speedups and efficiencies. In most examples, the PPFC
achieved high speedups and efficiencies on all target
multiprocessors.
The PPFC emphasizes efficiency and code portability. Although
PPFC employs a relatively simple data flow analysis, it produces
efficient code for various widely used application programs.
Short Code: [Averbuch:96a]
Reference: Vol. 8, No. 2, pp. 91--123 (C214)
Reliable Parallel Software Construction using PARSE
Ian Gorton, Innes Jelly, Jonathan Gray, and Toong Shoon Chan
The PARSE design methodology provides a hierarchical, object-based
approach to the development of parallel software systems. A system
design is initially structured into a collection of concurrently
executing objects which communicate via message-passing. A graphical
notation known as the process graph is then used to capture the
structural and important dynamic properties of the system. Process
graph designs can then be semi-mechanically transformed into complete
Petri nets to give a detailed, executable and potentially verifiable
design specification. From a complete design, translation rules for
target programming languages are defined to enable the implementation
to proceed in a systematic manner. The paper describes the steps in
the PARSE methodology and the process graph notation, and illustrates
the application of PARSE from design specification to program code
using a network protocol example.
Short Code: [Gorton:96a]
Reference: Vol. 8, No. 2, pp. 125--146 (C234)
An Experiment to Measure the Usability of Parallel Programming
Systems
Duane Szafron and Jonathan Schaeffer
The growth of commercial and academic interest in parallel and
distributed computing during the past fifteen years has been
accompanied by a corresponding increase in the number of available
parallel programming systems (PPS). However, little work has been
done to evaluate their usability, or to develop criteria for such
evaluations. As a result, the usability of a typical PPS is based on
how easily a small set of trivially parallel algorithms can be
implemented by its authors.
This paper discusses the design and results of an experiment to
compare objectively the usability of two PPSs. Half of the students
in a graduate parallel and distributed computing course solved a
problem using the Enterprise PPS while the other half used a PVM-like
library of message-passing routines. The objective was to measure
usability. This experiment provided valuable feedback as to what
features of PPSs are useful and the benefits they provide during the
development of parallel programs. Although many usability experiments
have been conducted for sequential programming languages and
environments, they are rare in the parallel programming domain. Such
experiments are necessary to help narrow the gap between what parallel
programmers want, and what current PPSs provide.
Short Code: [Szafron:96a]
Reference: Vol. 8, No. 2, pp. 147--166 (C235)
Convergence of Concurrent Markov Chain Monte Carlo Algorithms
Maurits Malfait and Dirk Roose
We examine the parallel execution of a class of stochastic
algorithms called Markov Chain Monte-Carlo (MCMC) algorithms. We
focus on MCMC algorithms in the context of image processing, using
Markov Random Field models. Our parallelization approach is based on
several, concurrently running, instances of the same stochastic
algorithm that deal with the whole data set. First we show that the
speed-up of the parallel algorithm is limited because of the
statistical properties of the MCMC algorithm. We examine coupled MCMC
as a remedy for this problem. Secondly, we exploit the parallel
execution to monitor the convergence of the stochastic algorithms in
a statistically reliable manner. This new convergence measure for
MCMC algorithms performs well, and is an improvement on known
convergence measures. We also link our findings with recent work in
the statistical theory of MCMC.
Short Code: [Malfait:96a]
Reference: Vol. 8, No. 3, pp. 167--189 (C236)
Towards a Complete Framework for Parallel Implementation of
Logic Languages: The Data Parallel Implementation of SEL
Giancarlo Succi and Carl Uhrik
Although logic languages, due to their non-declarative nature, are
widely proclaimed to be conducive in theory to parallel
implementation, in fact there appears to be insufficient practical
evidence to stimulate further developments in this field. The paper
puts forward various complications which arise in assuming a solely
process parallel approach as a possible explanation for this
situation.
As an alternative, data parallelism is posited as an underutilized
forte of logic programming. This paper illustrates a data parallel
implementation of a particular language called SEL which is based on
sets. Thus, SEL (Set Equational Language) is introduced as an example
of logic language which lends itself to an efficient data parallel
implementation.
The strategy of this implementations assumes an abstract machine
called SAM (set-oriented abstract machine) which is based on the WAM
(Warren Abstract Machine). SAM serves as an intermediary between the
SEL language and the target machine for the implementation, the
Connection Machine. Finally, some preliminary benchmarks are
presented.
Short Code: [Succi:96a]
Reference: Vol. 8, No. 3, pp. 191--204 (C165)
Do-Loop-Surface: An Abstract Representation of Parallel Program
Performance
Oscar Naím, Tony Hey, and Ed Zaluska
Performance is a critical issue in current massively parallel
processors. However, delivery of adequate performance is not automatic
and performance evaluation tools are required in order to help the
programmer to understand the behavior of a parallel program.
In recent years, a wide variety of tools have been developed for this
purpose including tools for monitoring and evaluating performance and
visualization tools. However, these tools do not provide an abstract
representation of performance. Massively parallel processors can
generate a huge amount of performance data and sophisticated methods
for representing and displaying these data (e.g., visual and nural) are
required. Performance views are not scalable in general and do not
represent an abstraction of the performance data.
The Do-Loop-Surface display is proposed as an abstract representation
of the performance of a particular do-loop in a program. It has been
used to improve the performance of a Matrix Multiply parallel algorithm
as well as to understand the behavior of the following applications:
Matrix Transposition (TRANS1) and Fast Fourier Transform (FFT1) from
the Genesis Benchmarks, and the kernel of a fluid dynamics package
(FIRE). These experiments were performed on a CM-5, Meiko CS-1, and a
PARSYS Supernode. The examples demonstrate that the Do-Loop-Surface
display is a useful way to represent performance. It is implemented
using AVS (Application Visualization System), a standard data
visualization package.
Short Code: [Naim:96a]
Reference: Vol. 8, No. 3, pp. 205--234 (C246)
Dynamics simulation of Multibody Chains on a Transputer System
B. Pond and I. Sharf
This paper describes the implementation on a transputer system of a
novel parallel algorithm for dynamics simulation of a multibody chain.
The algorithm is formulated at a level of parallelism which is natural
for the problem but is essentially unavailable to other simulation
dynamics algorithms. The experimental results demonstrate that one can
improve efficiency of computation by exploiting this level of
parallelism. However, analysis of the performance shows that the
serial component of the resulting parallel algorithm grows to be a
large fraction of the total parallel execution time and therefore
limits the speedup that can be achieved with this approach.
Short Code: [Pond:96a]
Reference: Vol. 8, No. 3, pp. 235--249 (C230)
A Parallel Algorithm for the Integer Knapsack Problem
D. Morales, J. L. Roda, C. Rodriguez, F. Almeida, and F. Garcia
A sequential algorithm with complexity
for the Integer Knapsack Problem is presented.
M is the capacity of the knapsack, and n the number
of objects. The algorithm admits an efficient parallelization on a
p-processor ring machine. The corresponding parallel algorithm
is . The parallel algorithm
is compared with a version of the well-known Lee algorithm adapted to
the Integer Knapsack Problem. Computational results on both a local
area network and a transputer network are reported.
Short Code: [Morales:96a]
Reference: Vol. 8, No. 4, pp. 251--260 (C242)
On Estimating the Useful Work Distribution of Parallel
Programs under the PT:
A Static Performance Estimator
Thomas Fahringer
In order to improve a parallel program's performance it is critical to
evaluate how even the work contained in a program is distributed over
all processors dedicated to the computation. Traditional work
distribution analysis is commonly performed at the machine level. The
disadvantage of this method is that it cannot identify whether the
processors are performing useful or redundant (replicated) work. This
paper describes a novel method of statically estimating the
useful work distribution of distributed memory parallel programs at
the program level, which carefully distinguishes between useful and
redundant work.
The amount of work contained in a parallel program, which correlates
with the number of loop iterations to be executed by each processor,
is estimated by accurately modeling loop iteration spaces, array
access patterns and data distributions. A cost function defines the
useful work distribution of loops, procedures and the entire
program. Lower and upper bounds of the described parameter are
presented.
The computational complexity of the cost function is independent of
the program's problem size, statement execution and loop iteration
counts. As a consequence, estimating the work distribution based on
the described method is considerably faster than simulating or
actually compiling and executing the program.
Automatically estimating the useful work distribution is fully
implemented as part of the PT,
which is a static parameter based
performance prediction tool under the Vienna Fortran Compilation
System (VFCS). The Lawrence Livermore Loops are used as a test-case
to verify the approach.
Short Code: [Fahringer:96a]
Reference: Vol. 8, No. 4, pp. 261--282 (C265)
Fail-Safe Concurrency in the EcliPSe System
F. Knop, V. Rego, and V. Sunderam
Local or wide-area heterogeneous workstation clusters are relatively
cheap and highly effective, though inherently unstable operating
environments for long-running distributed computations. We found this
to be the case in early experiments with a prototype of the
EcliPSe system, a software toolkit for replicative applications on
heterogeneous workstation clusters. Hardware or network failures in
computations that executed for over a day were not uncommon. In this
work, a variety of features for the incorporation of failure resilience
in the EcliPSe system are described. Key characteristics of
this fault-tolerant system are ease of use, low state-saving cost,
system scalability, and good performance. We present results of some
experiments demonstrating low state-saving overheads and small system
recovery times, as a function of amount of state saved.
Short Code: [Knop:96a]
Reference: Vol. 8, No. 4, pp. 283--312 (C237)
MP: An Application Specific Concurrent Language
J. S. Reeve and M. C. Rogers
In this paper the authors present the definition and implementation of a
concurrent language MP (Message Passer), which is aimed at programming
embedded multi-microprocessor systems. The novel features of the
language include the ability to program and develop the entire
distributed system as a single unit, and the provision of simple
restricted concurrency constructs that make such systems modular and
deadlock free. MP programs are invariant with respect to the
characteristics of a particular target machine, for instance, the
number of processors, and their mutual connectivity. Programs written
in MP can be transformed to a restricted set of Communicating
Sequential Processes (CSP), in which particular specifications can be
shown to be satisfied.
The restricted model of concurrency does not prevent the language
being being to express all of the known course-grained parallel
programming paradigms. However, the structured nature of the code
makes the model particularly attractive in concurrent embedded
systems, where deadlock freedom and program correctness are prime
issues.
An MP program is a collection of objects which execute concurrently,
maintain their own local state and connect with other objects via
bound variables. Explicit communication between objects is totally
transparent to the application programmer.
A compiler for the language has been built and a multiprocessor
run-time environment has been established for any system with a C
compiler and the message-passing standard MPI library. Several
applications have been tested, ranging from a clocked digital circuit
simulation, to a simple event-driven process controller.
Short Code: [Reeve:96a]
Reference: Vol. 8, No. 4, pp. 313--333 (C245)
Performance Modelling of Three Parallel Sorting Algorithms on a
Pipelined Transputer Network
V. Lakshmi Narasimhan and J. Armstrong
The implementation of three parallel sorting algorithms, namely, binary
sort, odd-even transposition sort, and bitonic sort, on a network of
transputers is analyzed in this paper. The variation in the
performance of these algorithms as the number of processors and sort
size are changed is investigated. Experimental results show that when
up to eight transputers are used, connected as a linear pipeline
configuration, all three algorithms can achieve reasonable speedup
ratios. The bitonic sort, binary sort, and odd-even transposition sort
achieve speedup ratios of 5, 4.4 and 4, respectively when eight processors
are used to sort 100,000 integers. Analytical models are derived which
can be used to predict the performance of the three algorithms when a
linear pipeline configuration is used. The predicted performance of
the algorithms is compared with the experimental performance in order
to validate the model. When the models are used to predict the
performance using 16 transputers, it is found the speedup does not
significantly improve compared to the performance achieved with 8
transputers. This shows that interprocessor communication has a
significant effect on the algorithmic performance when a larger number
of processors are used. The conclusions reinforce the fact that the
binary and bitonic sorting algorithms are not well-suited to a linear
pipeline configuration and that they may perform better if a different
topology were used, for example a mesh or a cube connection scheme.
Further, the analytical technique used for performance modelling as
elaborated in the paper can be employed for other multiprocessor systems
as well.
Short Code: [Narasimhan:96a]
Reference: Vol. 8, No. 5, pp. 335--355 (C232)
Performance Analysis of Distributed Implementations of a Fractal
Image Compression Algorithm
D. J. Jackson, and G. S. Tinney
Fractal image compression provides an innovative approach to lossy
image encoding, with a potential for very high compression ratios.
Because of prohibitive compression times, however, the procedure has
proved feasible in only a limited range of commercial applications.
In the paper the authors demonstrate that, due to the independent nature of
fractal transform encoding of individual image segments, fractal image
compression performs well in a coarse-grain distributed processing
system. A sequential fractal compression algorithm is optimized and
parallelized to execute across distributed workstations and an SP2
parallel processor using the parallel virtual machine (PVM) software.
The system utilizes both static and dynamic load allocation to obtain
substantial compression time speedup over the original, sequential
encoding implementation. Considerations such as workload granularity
and compression time versus number of processors and RMS tolerance
values are also presented.
Short Code: [Jackson:96a]
Reference: Vol. 8, No. 5, pp. 357--380 (C253)
Parallel DSP Algorithms on TurboNet: An Experimental System with
Hybrid Message-Passing/Shared-Memory Architure
Xi Li, Sotirios G. Ziavras, and Constantine N. Manikopoulos
This paper presents several parallel DSP (Digital Signal Processing)
algorithms and their performance analysis, targeting a hybrid
message-passing and shared-memory architecture that has been built at
New Jersey Institute of Technology. The current version of our system
contains eight powerful TMS320C40 processors. The algorithms are
implemented on our system using message-passing only, shared-memory
only, and, if possible, a combination of both of these parallel
processing paradigms. Comparisons show that TurboNet's robust, hybrid
architecture results in significant performance gains because of the
flexibility it introduces.
Short Code: [Li:96b]
Reference: Vol. 8, No. 5, pp. 387--411 (C252)
The W-Network: A Low-cost Fault-tolerant Multistage
Interconnection Network for Fine-grain Multiprocessing
Kevin B. Theobald
Large-scale multiprocessors require an efficient interconnection
network to achieve good performance. This network, like the rest of
the system, should be fault-tolerant (able to continue
operating even when there are hardware failures). This paper presents
the W-Network, a low-cost fault-tolerant MIN which is
well-suited to a large multiprocessor running fine-grain parallel
programs. It tolerates all single faults without any increases in
latency or decreases in bandwidth following a fault, because it
behaves just like the fault-free network even when there is a single
fault. It requires only one extra port per chip, which makes it
practical for a VLSI implementation. In addition, extra ports can be
added for replacing faulty processors with spares.
Short Code: [Theobald:96a]
Reference: Vol. 8, No. 6, pp. 415--428 (HPCS-6)
High-performance Computing using a Reconfigurable Accelerator
Reiner W. Hartenstein, Jürgen Becker,e Rainer Kress, and Helmut
Reinig
This paper introduces the MoM-3 as a reconfigurable accelerator for
high performance computing at a moderate price. By using a new
machine paradigm to trigger the operations in the MoM-3, this
accelerator is especially suited to scientific algorithms, where the
hardware structure can be configured to match the structure of the
algorithm. The MoM-3 efficiently uses reconfigurable logic devices to
provide a fine-grain parallelism, and multiple address generators to
have the complete memory bandwidth free for data transfers (instead of
fetching address computing instructions). Speed-up factors up to 82,
compared to state-of-the-art workstations, are demonstrated by means
of an Ising spin system simulation example. Adding the MoM-3 as an
accelerator enables achievement of supercomputer performance from a
low-cost workstation.
Short Code: [Hartenstein:96a]
Reference: Vol. 8, No. 6, pp. 429--443 (HPCS-1)
Scanning Parameterized Polyhedron using Fourier-Motzkin
Elimination
Marc Le Fur
The paper presents two algorithms for computing a control structure
whose execution enumerates the integer vectors of a parameterized
polyhedron defined in a given context. Both algorithms reconsider the
successive projection method, based on Fourier-Motzkin pairwise
elimination, defined by Ancourt and Irigoin. The way redundant
constraints are removed in their algorithm is revisited in order to
improve the computation time for the enumeration code of higher order
polyhedrons as well as their execution time. The algorithms presented
here are at the root of the code generation in the HPF compiler
PANDORE developed at IRISA, France; a comparison of these algorithms
with the one defined by Ancourt and Irigoin is given in the class of
polyhedrons manipulated by the PANDORE compiler.
Short Code: [Fur:96a]
Reference: Vol. 8, No. 6, pp. 445--460 (HPCS-2)
A Family of Parallel QR Factorization Algorithms
Gerard G. L. Meyer, and Mike Pascale
Rapid computation of the QR factorization of a matrix is fundamental
to many scientific and engineering problems. The paper presents a
family of algorithms parameterized by the number of processors
available P, arithmetic grain aggregation parameters
,
and communication grain aggregation parameter h,
which compute the QR factorization of a matrix
with minimal latency. The approach is particularly well suited for
dedicated distributed memory architectures such as linear arrays of
INMOS Transputers, Texas Instruments C40s or Analog Devices 21060s.
Short Code: [Meyer:96a]
Reference: Vol. 8, No. 6, pp. 461--473 (HPCS-5)
Visibility Analysis on a Massively Data-parallel Computer
Daniel Germain, Denis, Laurendeau, and Guy Vézina
Visibility analysis algorithms use digital elevation models (DEMs),
which represent terrain topography, to determine visibility at each
point on the terrain from a given location in space. This analysis
can be computationally very demanding, particularly when manipulating
high resolution DEMs accurately at interactive response rates.
Massively data-parallel computers offer high computing capabilities
and are very well-suited to handling and processing large regular
spatial data structures. In the paper, the authors present a new
scanline-based data-parallel algorithm for visibility analysis.
Results from an implementation onto a MasPar massively data-parallel
SIMD computer are also presented.
Short Code: [Germain:96a]
Reference: Vol. 8, No. 6, pp. 475--487 (HPCS-3)
Parallelizing a Powerful Monte Carlo Method for Electron Beam
Dose Calculations
W. Volken, P. Schwab, H. Neuenschwander, and P. G. Kropf
The Macro Monte Carlo (MMC) method has been developed to improve the
speed of traditional Monte Carlo calculations without significant loss
in accuracy. MMC runs about one order of magnitude faster for
clinically relevant irradiation situations. For routine use in a
clinical environment a further speedup is necessary. The MMC code was
therefore parallelized and implemented on PowerPC based Parsytec
PowerXplorer and GC Power Plus systems. The performance of the
parallel code is presented and compared to the sequential
implementations on standard hardware platforms. Almost linear speedup
is achieved for the parallel secitons of the code. Furthermore the
performance of the interprocessor communication based on a virtual
topology is demonstrated for the two different machine architectures.
Short Code: [Volken:96a]
Reference: Vol. 8, No. 6, pp. 489--498 (HPCS-4)
Effective Data Parallel Computation using the Psi Calculus
L. M. R. Mullin, and M. A. Jenkins
Large scale scientific computing necessitates finding a way to match
the high level understanding of how a problem can be solved with the
details of its computation in a processing environment organized as
networks of processors. Effective utilization of parallel
architectures can then be achieved by using formal methods to describe
both computations and computational organizations within these
networks. By returning to the mathematical treatment of a problem as
a high level numerical algorithm we can express it as an algorithmic
formalism that captures the inherent parallelism of the computation.
We then give a meta description of an architecture followed by the use
of transformational techniques to convert the high level description
into a program that utilizes the architecture effectively. The hope
is that one formalism can be used to describe both computations as
well as architectures and that a methodology for automatically
transforming computations can be developed. The formalism and
methodology presented in the paper is a first step toward the
ambitious goals described above. It uses a theory of arrays, the Psi
calculus, as the formalism, and two levels of conversions---one for
simplification and another for data mapping.
Short Code: [Mullin:96a]
Reference: Vol. 8, No. 7, pp. 499--515 (C218)
PB-BLAS: A Set of Parallel Block Basic Linear Algebra Subprograms
J. Choi, J. J. Dongarra, and D. W. Walker
We propose a new software package which would be very useful for
implementing dense linear algebra algorithms on block-partitioned
matrices. The routines are referred to as the block basic linear
algebra subprograms, and their use is restricted to computations in
which one or more of the matrices involved consists of a single row or
column of blocks, and in which no more than one of the matrices
consists of an unrestricted two-dimensional array of blocks. The
functionality of the block BLAS routines can also be provided by
Level 2 and 3 BLAS routines. However, for non-uniform memory access
machines the use of the block BLAS permit certain optimizations in
memory access to be taken advantage of. This is particularly true for
distributed memory machines, for which the block BLAS are referred to
as the parallel block basic linear algebra subprograms
(PB-BLAS). The PB-BLAS are the main focus of this paper, and for a
block-cyclic data distribution, a single row or column of blocks lies
in a single row or column of the processor template.
The PB-BLAS consist of calls to the sequential BLAS for local
computations, and calls to the BLACS for communication. The PB-BLAS
are the building blocks for implementing ScaLAPACK, the
distributed-memory version of LAPACK, and provide the same ease-of-use
and portability for ScaLAPACK that the BLAS provide for LAPACK.
The PB-BLAS consists of all Level 2 and 3 BLAS routines for dense
matrix omputations (not for banded matrix) and four auxiliary routines
for transposing and copying of a vector and/or a block vector. The
PB-BLAS are currently available for all numeric data types, i.e., single
and double precision, real and complex.
Short Code: [Choi:96a]
Reference: Vol. 8, No. 7, pp. 517--535 (C239)
SPEED: A Parallel Platform for Solving and Predicting the Performance
of PDEs on Distributed Systems
C.-C. Hui, M. Hamdi, and I. Ahmad
Distributed systems such as networks of workstations are becoming
an increasingly viable alternative to traditional supercomputer
systems for running complex scientific applications. A
large number of these applications require solving a set of partial
differential equations (PDEs). In this paper, we describe the
implementation and performance of SPEED (Scalable Partial Differential
Equation Environment on Distributed systems), a parallel platform which
provides an efficient solution for time-dependent PDEs. SPEED allows
the inclusion of a wide range of parameters and programming aids. PVM
is employed as the underlying message-passing system. The parallel
implementation has been performed using two algorithms. The first
algorithm is a two-phase scheme which uses conventional technique of
alternating phases of computation and communication. The second
algorithm employs a pre-computation technique that allows overlapping
of computation and communication. Both methods yield significant
speedups. The pre-computation technique reduces the communication time
between the workstations, but incurs additional overhead in buffer
management. Hence, if the saving in communication time is larger than
the overhead, the pre-computation technique outperforms the two-phase
algorithm. SPEED also provides a performance prediction methodology
that can accurately predict the performance of a given application on
the system before running the application. This methodology allows
the user to tune various parameters in order to identify system
bottlenecks and maximize the performance.
Short Code: [Hui:96a]
Reference: Vol. 8, No. 7, pp. 537--568 (C238)
Next: About this document
Theresa Canzian
Wed Sep 18 00:35:05 EDT 1996