Basic Concurrency: Practice and Experience Papers
Standard Referee Form
- C458: PFPC: An Infrastructure for Research on Parallelizing
Compilers
- Abstract: Parallelizing compilers analyze sequential programs, in
particular their loops, to detect hidden parallelism and automatically
restructure sequential programs into parallel subtasks executed on mul-
tiprocessor. This paper describes the design and implementation of an efficient
and precise parallelizing compiler to parallelize loops and achieve high
speedup rates on multiprocessor systems. As we know, the execution efficiency
of a loop can be enhanced if the loop is executed in parallel or partially
parallel, like a DOALL or DOACROSS loop. This paper also reports on a practical
parallel loop detector (PPD) that is implemented in PFPC on fnding the
parallelism in loops. The PPD can extract the potential DOALL and DOACROSS
loops in a program by verifying array subscripts. In addition, a new model by
using knowledge-based approach is proposed to exploit more loop parallelisms in
this paper. The knowledge-based approach integrates existing loop
transformations and loop scheduling algorithms to make good use of their
ability to extract loop parallelisms. Two rule-based systems, called the KPLT
and IPLS, are then developed using repertory grid analysis and attribute
ordering tables respectively, to construct the knowledge bases. Finally, a
runtime technique based on inspector/executor scheme is proposed in this paper
for fnding available parallelism on loops. Our inspector can determine the
wavefronts of a loop with any complex indirected array indexing pattern by
building a DEF-USE table. The inspector is fully parallel without any
synchronization. Experimental results show that the new method can handle any
complex data dependence pattern that cannot be handled by the previous
research. As an ultimate goal, a high-performance and portable FORTRAN
parallelizing compiler on shared-memory multiprocessors will be constructed. We
believe that our research may provide more insight into the development of a
high-performance parallelizing compiler.
- Chao-Tung Yang, Shian-Shyong Tseng Yun-Woei Fann, Ting-Ku Tsai Ming-Huei
Hsieh, Cheng-Tien Wu
- ROCSAT Ground Section National Space Program Office Hsinchu, Taiwan 300,
ROC and Dept. Computer & Information Science, National Chiao Tung
University, Hsinchu, Taiwan 300, ROC
- Submitted November 18,99
- mailto:ctyang@nspo.gov.tw or
mailto:sstseng@cis.nctu.edu.tw
- C458ctyang/c458ctyang.pdf
- Submit Info:C458ctyang
- C459: A Parallel Viterbi Decoding Algorithm
- Abstract: In this paper we express the Viterbi algorithm as a
matrix-vector reduction in which multiplication is replaced by addition and
addition by minimisation. The resulting algorithm is then readily parallelised
in a form suitable for implementation on a systolic processor array. We
describe the algorithm for BCH codes which have a task graph with valence
restricted to four inputs and four outputs. The method is also applicable to
convolution codes but the complexity of the task graph increases with the
number of input bits for these codes. Results for BCH codes are given for two
general purpose parallel machines, an IBM SP2 and a Meiko CS2.
- Dr. J.S. Reeve, Concurrent Computation Group, Department of Electronics and
Computer Science, University of Southampton Southampton, SO17 1BJ UK
- mailto:jsr@ecs.soton.ac.uk
- Submitted November 24 99
- http://www.ecs.soton.ac.uk/~jsr
- C459reeve/c459reeve.pdf
- Submit Info: C459reeve
- C460: Inverse Toeplitz Eigenproblem on Personal Computer
Networks
- Abstract:
- J.M. Badia, A.M. Vidal, Dept. of Computer science, Univ. Jaume I, 12071,
Castellón, Spain
- mailto:jbadia@inf.uji.es
- Submitted September 21, 99
- No URL
- No electronic version
- Submit Info: C460balia
- C461: Emmerald: A fast matrix-matrix multiply using INTEL SIMD
technology
- Abstract:
- Mr. Douglas Aberdeen, Computer Sciences Laboratory, Research School of
Information Science and Engineering, Australian National University, Canberra,
A.C.T. 2601, Australia
- mailto:daa@csl.anu.edu.au
- Submitted September 16. 99
- No URL
- No electronic version
- Submit Info: C461aberdeen
- C462: Experiences from Integrating Algorithmic and Systemic Load
Balancing Strategies
- Abstract:Load balancing increases the efficient use of existing
resources for parallel and distributed applications. At a coarse level of
granularity, advances in runtime systems for parallel programs have been
proposed in order to control available resources as efficiently as possible by
utilizing idle resources and using task migration. Simultaneously, at a finer
granularity level, advances in algorithmic strategies for dynamically balancing
computational loads by data redistribution have been proposed in order to
respond to variations in processor performance during the execution of a given
parallel application. Combining strategies from each level of granularity can
result in a system which delivers advantages of both. The resulting integration
is systemic in nature and transfers the responsibility of efficient resource
utilization from the application programmer to the runtime system. This paper
presents the design and implementation of a system that combines an algorithmic
fine-grained data parallel load balancing strategy with a systemic
coarse-grained task-parallel load balancing strategy, and reports on recent
experimental results of running a computationally intensive scientific
application under this integrated system. The experimental results indicate
that a distributed runtime environment which combines both task and data
migration can provide performance advantages with little overhead. It also
presents proposals for performance enhancements of the implementation, as well
as future exploration for effective resource management.
- Ioana Banicescu, Samuel H. Russ, Sheikh Ghafoor, Vijay Velusamy and Mark
Bilderback
- mailto:ioana@cs.msstate.edu
- Submitted December 17, 99
- URL: http://www.cs.msstate.edu/ioana
- Electronic version:C462banicescu/pc_98.pdf
- Submit Info: C462banicescu
- C464: A Survey of Concurrent Object-Oriented Languages
- Abstract: During the last decade object-oriented programming has
grown from marginal in uence into widespread acceptance. During the same
period, progress in hardware and networking has changed the computing
environment from sequential to parallel. Multi-processor workstations and
clusters are now quite common. Unnumbered proposals have been made to combine
both developments. Always the prime objective has been to provide the
advantages of object-oriented software design at the increased power of
parallel machines. However, combining both concepts has proven to be
notoriously difficult. Depending on the approach, often key characteristics of
either the object-oriented paradigm or key performance factors of parallelism
are sacriced, resulting in unsatisfactory languages. This survey rst
recapitulates well-known characteristics of both the object-oriented paradigm
and parallel programming, and then marks out the design space of possible
combinations by identifying various interdependencies of key concepts. The
design space is then lled with data points: For 111 proposed languages we
provide brief characteristics and feature tables. Feature tables, the
comprehensive bibliography, and web-addresses might help identifying open
questions and preventing re-inventions.
- Michael Philippsen
- mailto:phlipp@icsi.berkeley.edu
- Submitted February 14,00 Accepted March 2,00
- URL: http://wwwipd.ira.uka.de/~phlipp/mypapers/cool-survey.ps.gz
- Electronic version:C464mphillippsensurvey/cool-survey.pdf
- Submit Info: C464mphillippsensurvey
- C465: The Software Architecture of a Distributed Problem-Solving
Environment
- Abstract: This paper describes the functionality and software
architecture of a generic problem-solving environment (PSE) for collaborative
computational science and engineering. A PSE is designed to provide transparent
access to heterogeneous distributed computing resources, and its intended to
enhance research productivity by making it easier to construct, run, and
analyze the results of computer simulations. Although implementation details
are not discussed in depth, the role of software technologies such as CORBA,
Java, and XML is outlines. An XML-based component model is presented. The main
features of a Visual Component Composition Environment for software
development, and an Intelligent Resource Management System for scheduling
components, are described. Some prototype implementations of PSE sub-systems
are also presented.
- D.W. Walker, M. Li, O.F. Rana, M.S. Shields, and Y. Huang
- mailto:walkerdw@ornl.gov
- Submitted March 1,00
- Electronic Version: C465walkermarch00/cpandeinitial.pdf
- Submit Info: C465walkermarch00
- C466: Redistribution strategies for portable parallel FFT: A case
study
- Abstract: The best approach to parallelize multidimensional FFT
algorithms has long been under debate. Distributed transposes are widely used,
but they also vary in communication policies and hence performance. In this
work we analyze the impact of different redistribution strategies on the
performance of parallel FFT, on various machine architectures. We found that
some redistribution strategies were consistently superior, while some others
were unexpectedly inferior. An in-depth investigation into the reasons for this
behavior is included in this work.
- Anshu Dubey and Daniele Tessera
- mailto:dubey@tagore.uchicago.edu
- Submitted March 6,00
- Submit Info: C466dubeymarch00
- Note: No Electronic Version
- C467: VGDS: A Distributed Data Structure Framework for Scientific
Computation
- Abstract: This paper gives an overview of the VGDS (Virtual Global
Data Structure) project. The VGDS effort focuses on developing an integrated,
distributed environment that allows fast prototyping of a diverse set of
simulation problems in irregular scientific and engineering domains, focusing
on computations with irregular and adaptive structures. The framework defines
two base libraries: unstructured mesh and adaptive tree, that capture major
data structures involved inirregular scientic computation. The framework
defines multiple layers of class libraries which work together to provide
data-parallel representations to application developers while encapsulate
parallel implementation details into lower layers of the framework. The layered
approach enables easy extension of the base libraries to a variety of
application-specific data structures. Experimental results on a network of
workstations is reported.
- Pangfeng Liu, Jan-Jan Wu
- mailto:wuj@iis.sinica.edu.twmailto:pangfeng@cs.ccu.edu.tw
- Submitted April 21, 2000
- Electronic version:C467wuapril00/paper.pdf
- Submit InfoC467wuapril00
- C468: cJVM: A Cluster JVM Architecture for Single System
Image
- Abstract:cJVM is a Java Virtual Machine (JVM) which provides a
single system image of a traditional JVM while executing in a distributed
fashion on the nodes of a cluster. cJVM virtualizes the cluster, supporting any
pure Java application without requiring that applications be tailored
specifically for it. The aim of cJVM is to obtain improved scalability for a
class of Java Server Applications by distributing the application's work among
the cluster's computing resources. cJVM is based on a novel object model which
distinguishes between an application's view of an object (e.g., every object is
a unique data structure) and its implementation (e.g., objects may have
consistent replications on different nodes). This enables us to exploit
knowledge on the usage of individual objects to improve performance (e.g.,
using object replications to increase locality of access to objects).
Currently, we have already completed a prototype which runs pure Java
applications on a cluster of NT workstations connected via a Myrinet fast
switch. The prototype provides a single system image to applications,
distributing the application's threads and objects over the cluster. We have
used cJVM to run without change a real Java Server Application containing over
10K loc and achieve high scalability for it on a cluster. We also achieved
linear speedup for another application with a large number of independent
threads. This paper discusses cJVM's architecture and implementation. It
focuses on achieving a single system image for a traditional JVM on a cluster
while describing in short how we aim at obtaining scalability.
- Yariv Aridor, Michael Factor and Avi Teperman
- mailto:teperman@il.ibm.com
- Submitted May 1, 2000
- Electronic version:C468teperman/cjvm.pdf
- Submit Info:C468teperman
Special Issue on Java and Message Passing
C469 to C473 Edited by Mark Baker
- C469:An open Java System for SPMD Programming
- C470:Heterogeneous Parallel Computing using Java and WMPI
- Luis M. Silva, Paulo Martins João, Gabriel Silva
- C471:Java on Networks of Workstations (JavaNOW): A Parallel
Computing Framework Inspired by Linda, Actors, and the Message Passing
Interface (MPI)
- Vladimir S. Getov, Paul A. Gray, Vaidy S. Sunderam
- C472:Aspects of Portability and Distributed Execution for
JNI-Wrapped Message Passing Libraries
- C473: MPJ: MPI-like Message Passing for Java
- Bryan Carpenter, Vladimir Getov, Glenn Judd, Anthony Skjellum and Geoffrey
Fox
- C474: Parallel solution of rotating flows in cavities
- Abstract:In this paper, we investigate the parallel solution to the
rotating internal flow problems, using the Navier-Stokes equations as proposed
in [16] and [15]. A Runge-Kutta time-stepping scheme was applied to the
equations and both sequential and message- passing implementations were
developed, the latter using MPI , and were tested on an SGI Origin200
distributed, global shared memory parallel computer. The results show that our
approach to parallelize the sequential implementation requires little effort
whilst providing good results even for medium-sized problems, on this
particular computer.
- Rudnei Dias da Cunha and Alvaro Luiz de Bortoli
- mailto:rudnei@mat.ufrgs.br
- Submitted May 14, 2000
- Electronic version:C474dacunha/rotflow.pdf
- Submit Info:C474dacunha
- C475: Analysis and Measurement of the Effect of Kernel Locks in
SMP Systems
- Abstract:This paper reports the use of case studies to evaluate the
performance degradation caused by the kernel-level lock. We define the lock
ratio as a ratio of the execution time for critical sections to the total
execution time of a parallel program. The kernel-level lock ratio determines
how effective programs work on Symmetric MultiProcessor systems. We have
measured the lock ratios and the performance of three types of parallel
programs on SMP systems with Linux 2.0: matrix multiplication, parallel make,
and WWW server programs. Experimental results show that the higher the lock
ratio of parallel programs, the worse their performance becomes. keywords: SMP
Systems, Operating Systems, Parallel Programs, Per- formance Evaluation, Kernel
Lock
- Akihiro Kaieda and Yasuichi Nakayama; Atsuhiro Tanaka, Takashi Horikawa,
Toshiyasu Kurasugi and Issei Kino
- mailto:yasu@cs.uec.ac.jp
- Submitted May 24, 2000
- Electronic version:C475nakayama/kaieda.pdf
- Submit Info:C475nakayama