Concurrency and Computation:Practice and Experience
Here are instructions for
Grid Computing Computational Environments Special Issue
This homepage supports referees.
Standard Referee Form
Journal Vision and
Instructions
for Formatting etc.
WILEY
Journal Home
Page
Editorial Board
Useful CiteSeer
Citation Finder from NEC Research
Below you will find active abstracts.
Please send email to Geoffrey Fox,
fox@csit.fsu.edu, if you wish to referee any article.
We will send you link to full text online.
Articles under Consideration for Journal
- C467: VGDS: A Distributed Data Structure Framework for
Scientific Computation
- Abstract: This paper gives an overview of the VGDS
(Virtual Global Data Structure) project. The VGDS effort focuses on developing
an integrated, distributed environment that allows fast prototyping of a
diverse set of simulation problems in irregular scientific and engineering
domains, focusing on computations with irregular and adaptive structures. The
framework defines two base libraries: unstructured mesh and adaptive tree, that
capture major data structures involved inirregular scientic computation. The
framework defines multiple layers of class libraries which work together to
provide data-parallel representations to application developers while
encapsulate parallel implementation details into lower layers of the framework.
The layered approach enables easy extension of the base libraries to a variety
of application-specific data structures. Experimental results on a network of
workstations is reported.
- Pangfeng Liu, Jan-Jan Wu
- mailto:wuj@iis.sinica.edu.twmailto:pangfeng@cs.ccu.edu.tw
- Submitted April 21, 2000
- Comments To Authors October 1 2000
- C468: cJVM: A Cluster JVM Architecture for Single System
Image
- Abstract:cJVM is a Java Virtual Machine (JVM) which
provides a single system image of a traditional JVM while executing in a
distributed fashion on the nodes of a cluster. cJVM virtualizes the cluster,
supporting any pure Java application without requiring that applications be
tailored specifically for it. The aim of cJVM is to obtain improved scalability
for a class of Java Server Applications by distributing the application's work
among the cluster's computing resources. cJVM is based on a novel object model
which distinguishes between an application's view of an object (e.g., every
object is a unique data structure) and its implementation (e.g., objects may
have consistent replications on different nodes). This enables us to exploit
knowledge on the usage of individual objects to improve performance (e.g.,
using object replications to increase locality of access to objects).
Currently, we have already completed a prototype which runs pure Java
applications on a cluster of NT workstations connected via a Myrinet fast
switch. The prototype provides a single system image to applications,
distributing the application's threads and objects over the cluster. We have
used cJVM to run without change a real Java Server Application containing over
10K loc and achieve high scalability for it on a cluster. We also achieved
linear speedup for another application with a large number of independent
threads. This paper discusses cJVM's architecture and implementation. It
focuses on achieving a single system image for a traditional JVM on a cluster
while describing in short how we aim at obtaining scalability.
- Yariv Aridor, Michael Factor and Avi Teperman
- mailto:teperman@il.ibm.com
- Submitted May 1, 2000
- Comments To Authors October 1 2000; Accepted November 23
2000
- C474: Parallel solution of rotating flows in cavities
- Abstract:In this paper, we investigate the parallel
solution to the rotating internal flow problems, using the Navier-Stokes
equations as proposed in [16] and [15]. A Runge-Kutta time-stepping scheme was
applied to the equations and both sequential and message- passing
implementations were developed, the latter using MPI , and were tested on an
SGI Origin200 distributed, global shared memory parallel computer. The results
show that our approach to parallelize the sequential implementation requires
little effort whilst providing good results even for medium-sized problems, on
this particular computer.
- Rudnei Dias da Cunha and Alvaro Luiz de Bortoli
- mailto:rudnei@mat.ufrgs.br
- Submitted May 14, 2000
- Comments To Authors October 1 2000 and Accepted October 9
2000
- C475: Analysis and Measurement of the Effect of Kernel
Locks in SMP Systems
- Abstract:This paper reports the use of case studies to
evaluate the performance degradation caused by the kernel-level lock. We define
the lock ratio as a ratio of the execution time for critical sections to the
total execution time of a parallel program. The kernel-level lock ratio
determines how effective programs work on Symmetric MultiProcessor systems. We
have measured the lock ratios and the performance of three types of parallel
programs on SMP systems with Linux 2.0: matrix multiplication, parallel make,
and WWW server programs. Experimental results show that the higher the lock
ratio of parallel programs, the worse their performance becomes. keywords: SMP
Systems, Operating Systems, Parallel Programs, Per- formance Evaluation, Kernel
Lock
- Akihiro Kaieda and Yasuichi Nakayama; Atsuhiro Tanaka, Takashi
Horikawa, Toshiyasu Kurasugi and Issei Kino
- mailto:yasu@cs.uec.ac.jp
- Submitted May 24, 2000
- Accepted October 1 2000
- C476: Effective Multicast Programming in Large Scale
Distributed Systems:The DACE Approach
- Abstract:Many distributed applications have a strong
requirement for efficient dissemination of large amounts of information to
widely spread consumers in large networks.These include applications in
e-commerce and telecommunication.Publish/subscribe is considered one of the
most important interaction styles to model communication at large
scale.Producers publish information for a topic and consumers subscribe to the
topics they wish to be informed of.The decoupling of producers and consumers in
time and in space makes the publish/subscribe paradigm very attractive for
large scale distribution,especially in environments like the Internet. This
paper describes the architecture and implementation of DACE (Distributed
Asynchronous Computing Environment),a framework for publish/subscribe
communication based on an object- oriented programming abstraction in the form
of Distributed Asynchronous Collection (DAC).DACs capture the different
variations of publish/subscribe,without blurring their respective advantages.
The architecture we present is tolerant to network partitions and crash
failures.The underlying model is based on the notion of Topic Membership:a weak
membership for the parties involved in a topic.We present how Topic Membership
enables the realization of a robust and efficient reliable multicast for large
scale.The protocol ensures that,inside a topic,even a subscriber that is
temporarily partitioned away eventually receives a published message.
- Romain Boichat, Patrick Th. Eugster, Rachid Guerraoui, Joe
Sventek
- Swiss Federal Institute of Technology, Lausanne and Agilent
Laboratories Scotland, Edinburgh
- Email:Patrick.Eugster@lsesun6.epfl.ch
- Submitted July17, 2000
- Comments to Author November 23 2000
- Accepted February 3 2001
- C486: Parallel Versions of Stones Strongly Implicit
(SIP) Algorithm
- Abstract:In this paper, we describe various methods of
deriving a parallel version of Stones Strongly Implicit Procedure (SIP)
for solving sparse linear equations arising from finite difference
approximation to partial differential equations (PDEs). Sequential
versions of this algorithm have been very successful in solving semi-conductor,
heat conduction and flow simulation problems and an efficient parallel version
would enable much larger simulations to be run. An initial investigation of
various parallelising strategies was undertaken using a version of High
Performance Fortran (HPF) and the best methods were reprogrammed using the MPI
message passing libraries for increased efficiency. Early attempts concentrated
on developing a parallel version of the characteristic wavefront computation
pattern of the existing sequential SIP code. However, a red-black ordering of
grid points, similar to that used in parallel versions of the Gauss-Seidel
algorithm, is shown to be far more efficient. The results of both the wavefront
and red-black MPI based algorithms are reported for various size problems and
number of processors on a sixteen node IBM SP2.
- J.S. Reeve, A.D. Scurr and J.H. Merlin
- Department of Electronics and Computer Science, University of
Southampton
- Email:jsr@ecs.soton.ac.uk
- Submitted August 24, 2000
- Comments to Authors February 4 2001
- C487: Real-time Multi-spectral Image Fusion
- Abstract:This paper describes a novel real-time
multi-spectral imaging capability for surveillance applications. The capability
combines a new high-performance multi-spectral camera system with a distributed
algorithm that computes a spectral-screening Principal Component Transform
(PCT). The camera system uses a novel filter wheel design together with a
high-bandwidth CCD camera to allow image cubes to be delivered at 110 frames
per second with spectral resolution between 400 and 1000 nm. The filters used
in a particular application are selected to highlight a particular object based
on its spectral signature. The distributed algorithm allows image streams from
a dispersed collection of cameras to be disseminated, viewed, and interpreted
by a distributed group of analysts in real-time. It operates on networks of
commercial-off-the-shelf multiprocessors connected with high-performance (e.g.
gigabit) networking, taking advantage of multi-threading where appropriate. The
algorithm uses a concurrent formulation of the PCT to de-correlate and compress
a multi-spectral image cube. Spectral screening is used to give features that
occur infrequently (e.g. mechanized vehicles in a forest) equal importance to
those that occur frequently (e.g. trees in the forest). A human-centered
color-mapping scheme is used to maximize the impact of spectral contrast on the
human visual system. To demonstrate the efficacy of the multi-spectral system,
plant-life scenes with both real and artificial foliage are used. These scenes
demonstrate the systems ability to distinguish elements of a scene, based on
spectral contrast, that cannot be distinguished with the naked eye. The
capability is evaluated in terms of visual performance, scalability, and
real-time throughput. Our previous work on predictive analytical modeling is
extended to answer practical design questions such as For a specified
cost, what system can should be constructed and what performance will it
attain?
- Tiranee Achalakul, Stephen Taylor
- Department Computer Science, Syracuse University
- Email:tachalak@syr.edu
- Submitted September 14, 2000
- C488: Efficient Communication Using Message Prediction for
Cluster of Multiprocessors
- Abstract:With the increasing uniprocessor and SMP
computation power available today,interprocessor communication has become an
important factor that limits the performance of cluster of workstations.Many
factors including communication hardware overhead,communication software
overhead,and the user environment overhead (multithreading,multiuser) affect
the performance of the communication subsystems in such systems. A significant
portion of the software communication overhead belongs to a number of message
copying.Ideally,it is desirable to have a true zero-copy protocol where the
message is moved directly from the send buffer in its user space to the receive
buffer in the destination without any intermediate buffering.However due to the
fact that message-passing applications at the send side do not know the final
receive buffer addresses,early arrival messages have to be buffered at a
temporary area.In this paper,we show that there is a message reception
communication locality in message-passing applications.We have utilized this
communication locality and devised different message predictors at the receiver
sides of communications.In essence,these message predictors can be efficiently
used to drain the network and cache the incoming messages even if the
corresponding receive calls have not been posted yet.The performance of these
predictors,in terms of hit ratio,on some parallel applications are quite
promising and suggest that prediction has the potential to eliminate most of
the remaining message copies .We also show that the proposed predictors do not
have sensitivity to the starting message reception call,and that they perform
better than (or at least equal to) our previously proposed predictors in [3
].
- Ahmad Afsahi, Nikitas J.Dimopoulos
- Queens University and University of Victoria, Canada
- Email:ahmad@ee.queensu.ca
- Submitted September 14, 2000
- C489: Implementing a Dynamic Processor Allocation Policy
for Multiprogrammed Parallel Applications in the Solaris TM Operating
System
- Abstract:Parallel applications typically do not perform
well in a multiprogrammed envi- ronment that uses time-sharing to allocate
processor resources to the applications' parallel threads. Coscheduling related
parallel threads, or statically partitioning the system, often can reduce the
applications' execution times, but at the expense of reducing the overall
system utilization. To address this problem, there has been increasing interest
in dynamically allocating processors to applications based on their resource
demands and the dynamically varying system load. The Loop-Level Process Control
(LLPC) policy [16] dynamically adjusts the number of threads an application is
allowed to execute based on the application's available parallelism and the
overall system load. This study demonstrates the feasibility of incorporating
the LLPC strategy into an existing commercial operating system and
parallelizing compiler and provides further evidence of the performance
improvement that is possible using this dynamic allocation strategy. In this
implementation, applications are automatically parallelized and enhanced with
the appropriate LLPC hooks so that each application interacts with the modied
version of the Solaris operating system. The parallelism of the applications
are then dynamically adjusted automatically when they are executed in a
multiprogrammed environment so that all applications obtain a fair share of the
total processing resources.
- Kelvin K. Yue and David J. Lilja
- Sun Microsystems and University of Minnesota
- Email:lilja@ece.umn.edu
- Submitted October 14, 2000
- C490: Lesser Bear a Light-weight Process Library for
SMP computers
- Abstract:We have designed and implemented a light-weight
process (thread)li- brary called Lesser Bear for SMP
computers.Lesser Bear has high portability and thread-level
parallelism.Creating UNIX processes as virtual processors and a memory-mapped
.le as a huge shared-memory space enables Lesser Bear to execute threads in
parallel. Lesser Bear requires exclusive operation between peer virtual
processors,and treats a shared-memory space as a critical section for
synchronization of threads. Therefore,thread functions of the previous Lesser
Bear are serialized. In this paper,we present a scheduling mechanism to execute
thread functions in parallel.In the design of the proposed mechanism,we divide
the entire shared- memory space into partial spaces for virtual processors,and
prepare two queues (Protect Queue and Waiver Queue)for each partial space.We
adopt an algorithm in which ock operations are not necessary for
enqueueing.This algorithm allows us to propose a scheduling mechanism that can
reduce the scheduling overhead.The mechanism is applied to Lesser Bear and
evaluated by experimental results.
- Hisashi Oguma and Yasuichi Nakayama
- The University of Electro-Communications, Tokyo
- Email:oguma-h@igo.cs.uec.ac.jp
- Submitted October 24, 2000
- C492: PoLAPACK: Parallel Factorization Routines with
Algorithmic Blocking
- Abstract:LU, QR, and Cholesky factorizations are the most
widely used methods for solving dense linear systems of equations, and have
been extensively studied and implemented onvector and parallel computers. Most
of these factorization routines are implemented with block- partitioned
algorithms in order to perform matrix-matrix operations, that is, to obtain the
highest performance by maximizing reuse of data in the upper levels of memory,
such as cache. Since parallel computers have dierent performance ratios of
computation and com- munication, the optimal computational block sizes are
dierent from one another to generate the maximumperformance of an algorithm.
Therefore, the data matrix should be distributed with the machine specific
optimal block size before the computation. Too small or large a block size
makes getting good performance on a machine nearly impossible. In such a case,
getting a better performance may require a complete redistribution of the data
matrix. In this paper, we present parallel LU, QR, and Cholesky factorization
routines with an "algorithmic blocking" on 2-dimensional block cyclic data
distribution. With the algorithmic blocking, it is possible to obtain the near
optimal performance irrespective of the physical block size. The routines are
implemented on the Intel Paragon and the SGI/Cray T3E and compared with the
corresponding ScaLAPACK factorization routines.
- Jaeyoung Choi
- School of Computing Soongsil University 1-1, Sangdo-Dong,
Dongjak-Ku Seoul 156-743, KOREA
- Email:choi@comp.soongsil.ac.kr
- Submitted November 14, 2000
- Comments to Authors February 4 2001
- C493: Space-Time Tradeoffs for Parallel 3D Reconstruction
Algorithms for Virus Structure Determination
- Abstract:The 3D electron density determination of viruses,
from experimental data provided by electron microscopy, is a data intensive
computation that requires the use of clusters of PCs or parallel computers. In
this paper we report on three parallel algorithms used for 3D recon- struction
of asymmetric objects from their 2d projections. We discuss their
computational, communication, I/O, and space requirements and present some
performance data. The algorithms are general and can be used for 3D
reconstruction of asymmetric objects for applications other than structural
biology.
- Dan C. Marinescu, Yongchang Ji, and Robert E. Lynch
- Department of Computer Sciences Purdue University West Lafayette,
IN 47907
- Email:[dcm, ji,
rel]@cs.purdue.edu
- Submitted November 17, 2000
- C494: A New Parallel Domain Decomposition Method for the
Adaptive Finite Element Solution of Elliptic Partial Differential Equations
- Abstract:We present a new domain decomposition algorithm
for the parallel finite element solution of elliptic partial differential
equations. As with most parallel domain decomposition methods each processor is
assigned one or more subdomains and an iteration is devised which allows the
processors to solve their own subproblem(s) concurrently. The novel feature of
this algorithm however is that each of these subproblems is defined over the
entire domain -- although the vast majority of the degrees of freedom for each
subproblem are associated with a single subdomain (owned by the corresponding
processor). This ensures that a global mechanism is contained within each of
the subproblems tackled and so no separate coarse grid solve is required in
order to achieve rapid convergence of the overall iteration. Furthermore, by
following the paradigm introduced in [5], it is demonstrated that this domain
decomposition solver may be coupled easily with a conventional mesh refinement
code, thus allowing the accuracy, reliability and efficiency of mesh adaptivity
to be utilized in a well load-balanced manner. Finally, numerical evidence is
presented which suggests that this technique has significant potential, both in
terms of the rapid convergence properties and the efficiency of the parallel
implementation.
- Randolph E. Bank and Peter K. Jimack
- Dept. of Mathematics, University of California at San Diego, La
Jolla, CA 92093, USA.School of Computer Studies, University of Leeds, Leeds LS2
9JT, UK.
- Email: pkj@comp.leeds.ac.uk
- Recieved 19 May 2000, Accepted December 7 2000,
- C495: Development and Performance Analysis of Real-World
Applications for Distributed and Parallel Architectures
- Abstract:Several large real-world applications have been
developed for distributed and parallel architectures. We examine two different
program development approaches: First, the usage of a high-level programming
paradigm which reduces the time to create a parallel program dramatically but
sometimes at the cost of a reduced performance. A source-to-source compiler,
has been employed to automatically compile programs written in a
high-level programming paradigm into message passing codes. Second,
manual program development by using a low-level programming paradigm
such as message passing enables the programmer to fully exploit a given
architecture at the cost of a time-consuming and error-prone effort.
Performance tools play a central role to support the performance-oriented
development of applications for distributed and parallel architectures. SCALA
a portable instrumentation, measurement, and post-execution performance
analysis system for distributed and parallel programs has been used to
analyse and to guide the application development by selectively instrumenting
and measuring the code versions, by comparing performance information of
several program executions, by computing a variety of important performance
metrics, by detecting performance bottlenecks, and by relating performance
information back to the input program. We show several experiments of SCALA
when applied to real-world applications. These experiments are conducted for a
NEC Cenju-4 distributed memory machine and a cluster of heterogeneous
workstations and networks.
- T. Fahringer, P.Blaha, A. Hossinger, J. Luitz, E. Mehofer, H.
Moritsch, B. Scholz
- Institute for Software Science, University of Vienna
Liechtensteinstrasse 22, A-1092, Vienna, Austria. Also Vienna University of
Technology and Dept. of Business administration at University of Vienna
- Email: tf@par.univie.ac.at
- Recieved August 1 1999, Revised November 20 2000, Accepted
December 18 2000,
- C496: Open Consensus
- Abstract:This paper presents the abstraction of open
consensus and argues for its use as an effective component for building
reliable agreement protocols in practical asynchronous systems where processes
and links can crash and recover.The specification of open consensus has a
decoupled on-demand and re-entrant behaviour that make its use very
efficient,especially in terms of forced logs, which are known to be major
sources of overhead in distributed systems.We illustrate the use of open
consensus as a basic building block to develop a modular, yet efficient, total
order broadcast protocol.Finally, we describe our Java implementation of our
open consensus abstraction and we convey our efficiency claims through some
practical performance measures.
- Romain Boichat, Svend Frølund, Rachid Guerraoui
- Swiss Federal Institute of Technology, CH 1015, Lausanne and
Hewlett-Packard Laboratories, 1501 Page Mill Rd, Palo Alto
- Email: Romain.Boichat@epfl.ch
- Received January 17 2001
- C497: Java for High-Performance Computing
- Abstract:There has been an increasing research interest in
extending the use of Java towards high-performance demanding applications such
as scalable web servers, multimedia applications, and large-scale scientific
applications. However, given the low performance of current Java
implementations, these application domains pose new challenges to both the
application designer and systems developer. In this paper we describe and
classify several important proposals and environments that tackle Java's
performance bottlenecks in order to make the language an effective option for
high-performance computing. We further survey most significant performance
issues while exposing the potential benefits and limitations of current
solutions in such a way that a framework for future research efforts can be
established. We show that most of the proposed solutions can be classified
according to some combination of the three basic parameters: the model adopted
for inter-process communication, language extensions, and the implementation
strategy. In addition, we examine other relevant issues, such as
interoperability, portability, and garbage collection.
- M. Lobosco, C. Amorim, and O. Loques
- Instituto de Computação Universidade Federal
Fluminense Niterói, Rio de Janeiro, Brazil
- Email: loques@ic.uff.br
- Received January 15 2001