Concurrency and Computation:Practice and Experience

Here are instructions for 2002 Java Grande/ISCOPE Special Issue due January 15 2003 -- CLOSED
The Grid 2002: Architectures, Environments and Applications Book and Special Issue home page -- CLOSED

This homepage supports referees.

Standard Referee Form

Journal Vision

WILEY Journal Home Page

Editorial Board

Useful CiteSeer Citation Finder from NEC Research

Below you will find active abstracts.

Please send email to Geoffrey Fox, gcf@indiana.edu, if you wish to referee any article.

We will send you link to full text online.

Articles under Consideration for Journal

2002-2003 Articles
2001 Articles
2000 Articles

Submitted in 2002-2003 (top)

The Grid 2002: Architectures, Environments and Applications Book and Special Issue home page
Here are instructions for 2002 Java Grande/ISCOPE Special Issue due January 15 2003 -- CLOSED
C588: Coordinating Components in Middleware Systems
- Abstract:Configuration and coordination are central issues in the design and implementation of middleware systems and are one of the reasons why building such systems is more di.cult and complex than constructing stand-alone sequential programs. Through configuration, the structure of the system is established — which elements it contains, where they are located and how they are interconnected. Coordination is concerned with the interaction of the various components —when an interaction takes place, which parties are involved, what protocols are followed. Its purpose is to coordinate the behaviour of the various components in a way that meets the overall system speci.cation. The open and adaptive nature of middleware systems makes the task of configuration and coordination particularly challenging. We propose a model that can operate in such an environment and enables the dynamic integration and coordination of components through observation of their behaviour.
- Matthias Radestock, Susan Eisenbach
- Trans Enterprise Computer Communications, The Lee Valley Tecnopark, Ashley Road London N17 9LN; Department of Computing, Imperial College of Science, Technology and Medicine, London SW7 2BZ, UK
- sue@doc.ic.ac.uk
- Received January 8 2002, Comments to Authors April 18 2002, Accepted August 6 2002
C589: DEADLOCK DETECTION IN MPI PROGRAMS
- Abstract:Message Passing Interface (MPI) is commonly used to write parallel programs for distributed memory parallel computers. MPI-CHECK is a tool developed to aid in the debugging of MPI programs that are written in free or fixed format Fortran 90 and Fortran 77. This paper presents the methods used in MPI-CHECK 2.0 to detect many situations where actual and potential deadlocks occur when using blocking and non-blocking point-to-point routines as well as when using collective routines.
- Glenn Luecke, Yan Zou, James Coyle, Jim Hoekstra, Marina Kraeva
- High Peformance Computing Group Iowa State University, Ames, Iowa 50011-2251
- grl@iastate.edu
- Received January 22 2002, Comments to Authors April 18 2002, accepted May 29 2002
C590: Research Agenda for the Semantic Grid: A Future e-Science Infrastructure
- Abstract:e-Science offers a promising vision of how computer and communication technology can support and enhance the scientific process. It does this by enabling scientists to generate, analyse, share and discuss their insights, experiments and results in a more effective manner. The underlying computer infrastructure that provides these facilities is commonly referred to as the Grid. At this time, there are a number of grid applications being developed and there is a whole raft of computer technologies that provide fragments of the necessary functionality. However there is currently a major gap between these endeavours and the vision of e-Science in which there is a high degree of easy-to-use and seamless automation and in which there are flexible collaborations and computations on a global scale. To bridge this practice–aspiration divide, this report presents a research agenda whose aim is to move from the current state of the art in e-Science infrastructure, to the future infrastructure that is needed to support the full richness of the e-Science vision. Here the future e-Science research infrastructure is termed the Semantic Grid (Semantic Grid to Grid is meant to connote a similar relationship to the one that exists between the Semantic Web and the Web). In more detail, this document analyses the state of the art and the research challenges that are involved in developing the computing infrastructure needed for e-Science. In so doing, a conceptual architecture for the Semantic Grid is presented. This architecture adopts a service oriented perspective in which distinct stakeholders in the scientific process provide services to one another in various forms of marketplace. The view presented in the report is holistic, considering the requirements of e-Science and the e-Scientist at the data/computation, information and knowledge layers. The data, computation and information aspects are discussed from a distributed systems viewpoint and in the particular context of the Web as an established large scale infrastructure. A clear characterisation of the knowledge grid is also presented. This characterisation builds on the emerging metadata infrastructure with knowledge engineering techniques. These techniques are shown to be the key to working with heterogeneous information and also to working with experts and establishing communities of e-Scientists. The underlying fabric of the Grid, including the physical layer and associated technologies, is outside the scope of this document. Having completed the analysis, the report then makes a number of recommendations that aim to ensure the full potential of e-Science is realised and that the maximum value is obtained from the endeavours associated with developing the Semantic Grid. These recommendations relate to the following aspects:
  - The research issues associated with the technical and conceptual infrastructure of the Semantic Grid
  - The research issues associated with the content infrastructure of the Semantic Grid
  - The bootstrapping activities that are necessary to ensure the UK’s grid and e-Science infrastructure is widely disseminated and exemplified
  - The human resource issues that need to be considered in order to make a success of the UK e-Science and grid efforts
  - The issues associated with the intrinsic process of undertaking e-Science
  - The future strategic activities that need to be undertaken to maximise the value from the various Semantic Grid endeavours
- David De Roure, Nicholas Jennings and Nigel Shadbolt
- Department of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
- dder@ecs.soton.ac.uk
- Received January 23 2002, Comments to Authors May 29 2002; paper split up and accepted in Grid Special issue July 9 2002
C591: Feature Extraction and Visualization of Clusters with Parallel Processor from Large-Data Sets from Particle Simulations
- Abstract:Simulating natural phenomena at greater accuracy results in an explosive growth of data. Large - scale simulations with particles currently involve ensembles consisting of 10^6-10^9 particles in 10^5- 10^6 timesteps. Thus, the data file produced in a single run ranges from tens of gigabytes to hundreds of terabytes. This data bank allows one to reconstruct the spatio-temporal evolution of both the particle system in whole and each particle separately. Realistically, to look at large data set at full resolution at all times is not necessary. We show that the agglomerative clustering technique, based on the mutual nearest neighborhood (MNN) concept, can be easily adapted for efficient visualization of huge data sets from large-scale simulations with particles at different resolution level. We present the parallel algorithm for MNN clustering and its timings on the IBM SP and SGI/Origin 3800 multiprocessor systems. The high efficiency obtained is mainly due to the similarity in the algorithmic structure of MNN clustering and particle methods. We show examples drawn from MNN application in visualization and analysis on the order of a few hundred gigabytes of data from discrete particle simulations, using dissipative particle dynamics and fluid particle model. Besides material sciences, this clustering procedure can be applied to other fields, such as earthquake events and stellar populations in nebula clusters.
- Krzysztof Boryczko, Witold Dzwinel, David A.Yuen
- AGH Institute of Computer Science, al. Mickiewicza 30, 30-059, Kraków, Poland; Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota 55415-1227, USA
- dwitek@msi.umn.edu
- Received January 29 2002, Comments to Authors May 29 2002, Accepted June 14 2002
C592: The Real-Time Message Passing Interface Standard (MPI/RT-1.1)
- Abstract:The MPI/RT standard is the product of the work of many people working in an open community standards group over a period of six plus years. The purpose of this archival publication is to preserve the significant knowledge and experience that was developed in real-time message passing systems as a consequence of the R&D e®ort as well as in the specification of the standard. Interestingly, several implementations of MPI/RT (as well as comprehensive test suites) have been created in industry and academia over the period during which the standard was created. MPI/RT is likely to gain adoption interest over time, and this adoption may be driven by the promulgation of the standard including this publication. We expect that, when people are interested in understanding options for reliable, QoS-oriented parallel computing with message passing, MPI/RT will serve as a foundation for such a study, whether or not its complete formalism is accepted into other systems or standards. MPI/RT is an o®shoot of MPI-1, and retains many of the communication patterns of MPI- 1. However, MPI/RT has investigated issues of fine-grain concurrency and highest achievable performance in many ways that were evidently inappropriate for MPI-1 to do in the scientific computing space in which it resides, with its much broader audience. MPI/RT focuses on earlybinding (planned transfer), concurrent message passing, while integrating multiple real-time models: time-based, event-driven, and priority-oriented channels. Group admission control and declarative (deferred early binding) semantics support the goal of hard-real-time for the message-passing component of computation. Importantly, MPI/RT emphasizes the decoupling of message transfer and process/thread scheduling as part of its contribution to parallel processing with QoS. Buffer management and state transition diagrams are also integral to the notion of streaming data into and out of processors in a way that is consistent with QoS, and friendly to zero-copy approaches to communication. MPI/RT has also made strides in the direction of a parallel middleware specification by emphasizing an object-oriented design for the API compared to an object-based API or ad hoc API. The advantages of these are plain in the standard, in that the functionality has useful polymorphic adaptations where needed. Furthermore, the concepts that derive from MPI-1 (such as collective operations) appear as objects in MPI/RT. This modification has allowed for the removal of certain constructs in MPI-1 (such as the communicator), in favor of a specification and implementation phase for objects that describe communication in MPI/RT. Overall, a cleaner, more extensible design exists, which does not utilize more resources per se than those which are needed to admit the required channels for a program. Both offline and online admission control is contemplated, and multiple modes are supported, albeit weakly. MPI/RT-1.1, the standard version described here, does not cover all possible real-time parallel programming possibilities. It is silent concerning process/thread scheduling, so in some sense, still has strong aspects of "best effort," in terms of process scheduling. Leaving process scheduling as an orthogonal concern was intentional, so that the best concepts in these areas would be used in concert with MPI/RT, rather than offering a monolith. Furthermore, MPI/RT-1.1 does not explicitly address mode changes (with guaranteed modechange QoS) between sets of channels, with invariants and non-invariants among the resources consumed. This remains important work for the future. The object-oriented, resource-conscious approach of MPI/RT naturally extends to multiple modes. Work to realize this in a standard or in prototypes remains for future work, although it was discussed and prototyped extensively during standardization. It is interesting to consider whether the connection-oriented, but limited QoS and resource specification of MPI/RT leads to a more scalable or less scalable system environment than that posed by MPI and similar middleware. While connections themselves indicate that resources will be assigned per connection, only those connections that are program-mandated are actually built. By way of contrast, in MPI, it is necessary to offer a virtual all-to-all communication topology, and introduce overheads associated with either the static realization of such a topology, or else the dynamic build-up/tear down of connections in constrained environments seeking to scale. Events over the past six years, involving the evolution of networking technology make MPI/RT as interesting as it was when started, and possibly of more ubiquitous application in the long term. Infiniband, Rapid I/O, 3GIO, and other System Area Network standards are likely to offer rudimentary QoS over time in real applications. Likewise, the production of massively concurrent supercomputers (104 nodes or more), is likely to drive the need for predictable message passing in the runtime aspects of such systems. These events are likely to cause the ideas and concepts defined in this standard to have impact in areas far broader than originally anticipated.
- Anthony Skjellum, Arkady Kanevsky, Yoginder S. Dandass,Jerrell Wattsy, Steve Paavola ,Dennis Cottel, Greg Henleyz, L. Shane Hebertx, Zhenqian Cui, Anna Rounbehler
- Mississippi State University; MPI Software Technology, Inc.; Network Appliance; California Institute of Technology; Syracuse University; SKY Computers; SPAWAR Systems Center, US Navy; Raytheon Company; Real-Time Message Passing Interface (MPI/RT) Forum http://www.mpirt.org
- tony@hpcl.msstate.edu
- Received February 9 2002, Comments to Authors May 29 2002, accepted December 31 2002
C593: DYNAMIC CLASS MODEL IN OBJECT-ORIENTED DESIGN
- Abstract:This paper proposes design with the help of objet-oriented modelling. The class model of designed object can be dynamically changed by a system. Thanks to the specificity of the class model we receive new emergent solutions and a warranty of space sparing. The advantage of the proposed approach is a well-grounded theoretical base and polynomial time of operations. It gives a possibility of practical applications and extensions of proposed model.
- Marcin Skowron
- Institute of Computer Science Jagiellonian University Nawojki 11, 30-072 Kraków, Poland
- Marcin.Skowron.Student@softlab.ii.uj.edu.pl
- Received 12 February 2002, Comments to Author July 4 2002
- Full Paper: ../CCPEwebresource/c593skowron/c593skowron.pdf
C594: A parallel data assimilation model for oceanographic observations
- Abstract:In this paper we describe the development of a program that aims at the optimal integration of observed data in an oceanographic model describing the water transport phenomena in the Agulhas area at the tip of South Africa. Two parallel implementations, MPI and OpenMP, are described and experiments with respect to speed and scalability on a Compaq AlphaServer SC and an SGI Origin3000 are reported
- Fons van Hees, Aad J. van der Steen; Peter Jan van Leeuwen
- Computational Physics, Utrecht University P.O. Box 80195, 3508 TD Utrecht; Institute for Marine and Atmospheric Research Utrecht, Utrecht University P.O. Box 80005, 3508 TA Utrecht
- A.vanderSteen@phys.uu.nl
- Received 13 February 2002, Comments to Author July 4 2002, accepted August 6 2002
- Full Paper: ../CCPEwebresource/c594vandersteen/c594join.pdf
C595: A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms
- Abstract:Since a static data distribution does not give satisfactory results for parallel irregular algorithms, there is need for a dynamic distribution of data that can be adapted to the current runtime behavior of the algorithm. Task pools are data structures which can distribute data dynamically to different processors.
  This paper discusses the characteristics of task-based algorithms and describes the implementation of selected types of task pools for shared-memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. Results of these implementations measured on three different shared-memory systems are shown for a synthetic algorithm and the parallel hierarchical radiosity method.
- Matthias Korch and Thomas Rauber
- Martin-Luther-Universit at Halle-Wittenberg, Fachbereich Mathematik und Informatik, Institut für Informatik D-06099, Halle (Saale), Germany
- korch@informatik.uni-halle.de
- Received 26 February 2002, Comments to Author July 4 2002, accepted December 31 2002
- Full Paper: ../CCPEwebresource/c595korch/c595tplong.pdf
C596: Towards an Improved Memory Model for Java
- Abstract:The Java Language Specification contains a collection of rules that describe how a thread's working memory should interact with the main memory in a multi-threaded, shared variables environment. We formalize and analyze the Java memory model rules from the original specification. We are able to remove many unnecessary constraints and show deficiences. We show that sequential consistency does not hold in the original memory model and propose a correction. We also add rules for the prevention of deadlock. Finally we discuss possible compiler optimizations based on our memory model.
- Vishnu Kotrarajas and Susan Eisenbach
- Department of Computing, Imperial College, 189 Queen's Gate, London SW7 2BZ, U.K.
- sue@doc.ic.ac.uk
- Received 20 March 2002, Comments to Author July 4 2002
C597: Multi-Wavelength Image Space: Another Grid-Enabled Science
- Abstract:We describe how the Grid enables new research possibilities in astronomy through multi-wavelength images. To see sky images in the same pixel space, they must be projected to that space, a compute-intensive process. There is thus a virtual data space induced that is defined by an image and the applied projection. This virtual data can be created and replicated with Planners and Replica catalog technology developed under the GriPhyN project. We plan to deploy our system (MONTAGE) on the US Teragrid. Grid computing is also needed for ingesting data -- computing background correction on each image -- which forms a separate virtual data space. Multi-wavelength images can be used for pushing source detection and statistics by an order of magnitude from current techniques; for optimization of multi-wavelength image registration for detection and characterization of extended sources; and for detection of new classes of essentially multi-wavelength astronomical phenomena. The paper discusses both the grid architecture and the scientific goals.
- Roy Williams, Bruce Berriman, Ewa Deelman, John Good, Joseph Jacob, Carl Kesselman, Carol Lonsdale, Seb Oliver, Thomas A. Prince
- CACR, California Institute of Technology, USA, IPAC, California Institute of Technology, USA, ISI, University of Southern California, USA, JPL, California Institute of Technology, USA, Astronomy Centre, University of Sussex, UK
- roy@cacr.caltech.edu
- Received May 24 2002, Comments to Authors August 6 2002, Accepted August 27 2002
C605: Guiding RMI Synthesis with Interactive Graph Placement
- Abstract:CentiJ is a metacomputing system that message forwards through an RMI (Remote Method Invocation) layer for distributed computation by using a bridge pattern code synthesizer. CentiJ reuses original implementations, without altering the source code. It also provides for a means of partitioning a sequential program across a network of workstations (NOWs) by using interactive graph placement. The graph placement algorithms is force-based, treating nodes as charged bodies and arcs as springs. Interaction is used to take advantage of the programmers’ knowledge about a program. The assumption is that the programmer can do a better job at course-grained problem-partitioning than any modern parallelizing compiler.
- Douglas Lyon
- Computer Engineering Dept. Fairfield University, Fairfield CT 06430
- lyon@docjava.com
- Received June 12 2002, Comments to Author August 14 2002
C608: A Unified Peer-to-Peer Database Framework and its Application for Scalable Service Discovery
- Abstract:In a large distributed system spanning many administrative domains such as a DataGrid, it is often desirable to maintain and query dynamic and timely information about active participants such as services, resources and user communities. However, in such a database system, the set of information tuples in the universe is partitioned over one or more distributed nodes, for reasons including autonomy, scalability, availability, performance and security. It is not obvious how to enable general-purpose discovery query support and collective collaborative functionality that operate on the distributed system as a whole, rather than on a given part of it. Further, it is not obvious how to allow for search results that are fresh, allowing dynamic content. It appears that a Peer-to-Peer (P2P) database network may be well suited to support dynamic distributed database search, for example for service discovery. In this paper, we take the first steps towards unifying the fields of database management systems and P2P computing, which so far have received considerable, but separate, attention. We extend database concepts and practice to cover P2P search. Similarly, we extend P2P concepts and practice to support powerful general-purpose query languages such as XQuery and SQL. As a result, we devise the Unified Peer-to-Peer Database Framework (UPDF), which allows to express specific applications for a wide range of data types, node topologies (e.g. ring, tree, graph), query languages (e.g. XQuery, SQL), query response modes (e.g. Routed, Direct and Referral Response), neighbor selection policies (in the form of an XQuery), pipelining, timeout and other scope characteristics.
- Wolfgang Hoschek
- CERN IT Division, European Organization for Nuclear Research, 1211 Geneva 23, Switzerland
- wolfgang.hoschek@cern.ch
- Received June 21 2002, Comments to Author 14 2002
C609: The Web Service Discovery Architecture
- Abstract:An enabling step towards increased Internet software execution exibility is the web services vision of distributed computing where programs are no longer configured with static information. Rather, the promise is that programs are made more flexible and powerful by querying Internet databases (registries) at runtime in order to discover information and network attached third-party building blocks. Services can advertise themselves and related metadata via such databases, enabling the assembly of distributed higher-level components. In this paper, we propose and specify a discovery architecture that supports this vision, the so-called Web Service Discovery Architecture (WSDA). WSDA promotes an interoperable web service layer by defining appropriate services, interfaces, operations and protocol bindings for discovery. It embraces and integrates solid industry standards. It is modular because it defines a small set of orthogonal multi-purpose communication primitives (building blocks) for discovery. These primitives cover service identification, service description retrieval, data publication as well as minimal and powerful query support. Each communication primitive is designed to avoid any unnecessary complexity. WSDA is open and flexible because each primitive can be used, implemented, customized and extended in many ways. It is powerful because the individual primitives can be combined and plugged together by specific clients and services to yield a wide range of behaviors and emerging synergies. It is unified because it subsumes an array of disparate concepts, interfaces and protocols under a single semi-transparent umbrella. It is non-intrusive because it offers interfaces but does not mandate that every service must comply to these interfaces.
- Wolfgang Hoschek
- CERN IT Division, European Organization for Nuclear Research, 1211 Geneva 23, Switzerland
- wolfgang.hoschek@cern.ch
- Received June 10 2002, Comments to Author June 28 2002, Accepted August 14 2002
C610: Impact of Mixed-Parallelism on Parallel Implementations of Strassen and Winograd Matrix Multiplication Algorithms
- Abstract:In this paper we study the impact of the simultaneous exploitation of data- and task-parallelism, so-called mixed parallelism, on Strassen and Winograd matrix multiplication algorithms. For each of these algorithms, we propose two mixed-parallel implementations. The former follows the phases of the original algorithms while the latter has been designed as the result of a list scheduling algorithm. We give a theoretical comparison.
- F. Desprez and F. Suter
- LIP, ENS Lyon, 46 allee d'Italie, 69364 Lyon Cedex 07, France
- Frederic.Desprez@ens-lyon.fr
- Received July 1 2002, Comments to Author October 21 2002
C614: Smart Data: The Source of - and the Solution to - Security Risks
- Abstract:Note (from editor) this is a "vision" paper. It is suggested that online data, as it is annotated in the years to come by experts, will incrementally become far more usable by all users, including those who violate security controls. In fact, the main source of security in data, its natural heterogeneity and the resulting difficulty in making use of it, will sharply diminish in the years to come. But this "smartness" of data can also be exploited in order to greatly enhance data security. Further, it is suggested that the development of four different software technologies, if guided by an integrated vision, could serve as a broad-based and powerful platform for the collaborative specification and enforcement of security controls, and for the maintenance of these controls through multiple cycles of data transformation, integration, and reuse.
- Roger (Buzz) King
- Dept. of Computer Science, University of Colorado, Boulder CO, 80309
- roger@cs.colorado.edu
- Received July 3 2002, Comments to Author October 21 2002, Accepted December 31 2002
C624: Mobile Agents and Web Databases: Models and Experimentation
- Abstract:In this paper we present practical experiences gathered from the employment of two popular Java-based mobile-agent platforms, IBM's Aglets and Mitsubishi's Concordia. We present some basic distributed computing models and describe their adaptation to the mobile-agent paradigm. Upon these models we develop a set of frameworks for distributed database access over the World-Wide Web, using IBM's Aglets and Mitsubishi's Concordia platforms. We compare the two platforms both quantitatively and qualitatively. For the quantitative comparison, we propose, employ, and validate an approach to evaluate and analyze mobile-agent framework performance. For the qualitative assessment, we present our observations about the programmability and robustness of, and mobility provided by, the two platforms.
- G. Samaras, M. Dikaiakos, C. Spyrou
- Department of Computer Science, University of Cyprus, P.O. Box 20537, 1678 Nicosia, Cyprus
- cssamara@ucy.ac.cy
- Received July 11 2002, Comments to Authors December 31 2002
Cluster 2001 Special Issue: 6 papers in final accepted form at ../CCPEwebresource/c635toc640clustersi
C641: A Comparison of Concurrent Programming and Cooperative Multithreading under Load Balancing Applications
- Abstract: Two models of thread execution are the general concurrent programming execution model (CP) and the cooperative multithreading execution model (CM). CP provides nondeterministic process execution where context switches occur arbitrarily. CM provides threads that execute one at a time until they explicitly choose to yield the processor. This paper uses two classic applications to reveal the advantages and disadvantages of load balancing during process execution under CP and CM styles. These applications are programmed in two different languages (SR and Dynamic C) on different hardware (standard PCs and embedded system controllers). An SR-like run-time system, DesCaRTeS, was developed to provide inter-process communication for the Dynamic C implementations. This paper compares load balancing and non- load balancing implementations; it also compares CP and CM style implementations. The results show that: in cases of very high or very low workloads, load balancing slightly hindered performance; and in cases of moderate workload, both SR and Dynamic C implementations of load balancing generally performed well. Further, for these applications, CM style programs outperform CP style programs in some cases, but the opposite occurs in some other cases. This paper also discusses qualitative tradeoffs between CM style programming and CP style programming for these applications.
- Justin T. Maris, Aaron W. Keen, Takashi Ishihara, and Ronald A. Olsson
- Department of Computer Science, University of California, Davis, CA 95616 USA
- olsson@cs.ucdavis.edu
- Received July 31 2002, Comments to Authors December 31 2002, Accepted February 14 2003
C642 (Performance Section): Comparing the Performance of MPICH with Cray's MPI and with SGI's MPI
- Abstract: The purpose of this paper is to compare the performance of MPICH with the vendor MPI on a Cray T3E-900 and an SGI Origin 3000. Seven basic communication tests which include basic point-to-point and collective MPI communication routines were chosen to represent commonly used communinication patterns. Cray's MPI performed better (and sometimes significantly better) than MSU's MPICH for small and medium messages. They both performed about the same for the large messages. However for 3 tests MSU's MPICH was about 20 percent faster than Cray's MPI. SGI's MPI performed and scaled better (and sometimes significantly better) than MPICH for all messages, except for the scatter test where MPICH outperformed SGI's MPI for 1 Kbyte messages. The poor scalability of MPICH on the Origin 3000 suggests there may be scalability problems with MPICH.
- Glenn R. Luecke, Marina Kraeva, Lili Ju
- Iowa State University Ames Iowa
- grl@iastate.edu
- Received 16th October 2001, Revised 2nd July 2002, Accepted 1st Aug 2002
C643 (Performance Section): A Scalable HPF Implementation of a Finite Volume CEM Application on a CRAY T3E Parallel System
- Abstract: The time-dependent Maxwell equations are one of the most important approaches to describing dynamic or wide-band frequency electromagnetic phenomena. A sequential nite volume, characteristic-based procedure for solving the time-dependent, three-dimensional Maxwell equations has been implemented in Fortran successfully before. Due to its large requirement of memory space and high demand of CPU time, it is impossible to test the code with large array size. Hence, it is essential to implement the code on a parallel computing system. In this paper, we discuss an eÆcient and scalable parallelization of the sequential Fortran time-dependent Maxwell equations solver using High Performance Fortran (HPF). The background of the project, the theory behind the efficiency being achieved, the parallelization methodologies employed, and the experimental results obtained on the Cray T3E massively parallel computing system will be described in detail. Experimental runs show that the execution time is reduced drastically through parallel computing. The code is scalable up to 98 processors on the Cray T3E and has a similar performance compared to an MPI implementation. Based on the experimentation carried out in this research, we believe that a high level parallel programming language such as High Performance Fortran is a fast, viable and economical approach to parallelizing many existing sequential codes which exhibit a lot of parallelism.
- Yi Pan, Joseph J.S. Shang and Minyi Guo
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; Air Vehicle Directorate Air Force Research Laboratory WPAFB, OH 45433-7521, USA; Department of Computer Software, The University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580 Japan
- pan@cs.gsu.edu
- Received 16th October 2001, Revised 1st August 2002, Accepted 8th Aug 2002
C650: A Summary of Grid Computing Environments
- Abstract: This short paper summarizes a set of 28 papers gathered together by the GCE (Grid Computing Environment) group of the Global Grid Forum. This set is published in 2002 as a special issue of Concurrency and Computation: Practice and Experience. The papers are listed at the end of this report together with associated papers in a Grid Computing book
- Geoffrey Fox, Dennis Gannon, Mary Thomas
- Department of Computer Science, 215 Lindley Hall,150 S. Woodlawn Ave,. Bloomington IN 47405-7104; Texas Advanced Computing Center, The University of Texas at Austin, 10100 Burnet Road, Austin, Texas 78758
- gcf@indiana.edu
- Accepted August 6 2002
- Full Paper: ../CCPEwebresource/c650gridgcespcissueoverview/c650gcesurvey.pdf
C651: OpenMP-Oriented Applications for Distributed Shared Memory Architectures
- Abstract: The fast emergence of OpenMP as the preferable parallel programming paradigm for small-to-medium scale parallelism could decline unless OpenMP will show capabilities to be the model-of-choice for large scale high performance parallel computing of the next decade. The main stumbling block from adapting OpenMP for distributed shared memory (DSM) machines, which are based on architecture like cc-NUMA, stems from the absence of capabilities for data placement among processors and threads for achieving data locality. The absence of such mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called Copy-inside-Copy-back (CC) that exploits the privatization mechanism of OpenMP for data placement and re-placement. This technique enables one to distribute data without taking the control and the exibility from the programmer, and thus, is an alternative to the traditional implicit and explicit approaches. Moreover, the CC approach enables SPMD style of programming that makes the development process of an OpenMP application more structured, and simply to modify and debug. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessors machine. The lesson learnt from this study shows that OpenMP can deliver the desired large- scale parallelism although fast copy mechanism is essential.
- Ami Marowka, Zhenying Liu, Barbara Chapman
- Department of Computer Science, The University of Houston, Texas
- amimar2@yahoo.com
- Received August 9 2002, Comments to Authors December 31 2002, Accepted February 14 2003
C652: Parallel Operation of CartaBlanca on Shared and Distributed Memory Computers
- Abstract:We describe the parallel performance of the pure Java CartaBlanca code on heat transfer and multiphase fluid flow problems. CartaBlanca is designed for parallel computations on partitioned unstructured meshes. It uses Java’s thread facility to manage computations on each of the mesh partitions. Inter-partition communications are handled by two compact objects for node-by-node communication along partition boundaries and for global reduction calculations across the entire mesh. For distributed calculations, the JavaParty package from the University of Karlsruhe is demonstrated to work with CartaBlanca.
- N. T. Padial-Collins, W. B. VanderHeyden, D. Zhang, E. D. Dendy, D. Livescu
- Los Alamos National Laboratory Theoretical Division and Los Alamos Computer Science Institute Los Alamos, NM 87545
- nelylanl@lanl.gov
- Received August 13 2002, Comments to Authors December 31 2002, Accepted January 12 2003
C653: Parallel Program Debugging by Specification
- Abstract:Most message passing parallel programs employ logical process topologies with regular characteristics to support their computation. Since process topologies define the relationship between processes, they present an excellent opportunity for debugging. The primary benefit is that process behaviours can be correlatred, allowing expected behaviour to be abstracted and identified, and undesirable behaviour reported. However topology support is inadequate in most message parallel programming environments, including the popular Message Passing Interface (MPI) and the Parallel Virtual Machine (PVM). Programmers are forced to implement topology support themselves, increasing the possibility of introducing errors.
  This paper introduces a trace- and topology-based contract-like approach to parallel program debugging, driven by four distinct types of specification. Trace interpretation specifications allow trace data from a variety of sources and message passing libraries.to be interpreted in an abstract manner, and topology specifications address the lack of explicit topology knowledge, and facilitate the construction of user-consistent views of the debugging activity. Loop specifications express topology-consistent views of the debugging activity, allowing conformance testing of associated trace data, and error specifications specify undesirable event interactions, including mismatched message sizes and mismatched communication pairs. Both loop and error specifications are simplified by having knowledge of the actual topologies being debugged.
  The wealth of new debugging views and techniques made possible by our contract-like approach are also discussed.
- Simon Huband and Chris McDonald
- Department of Computer Science and Software Engineering, The University of Western Australia, Crawley, Western Australia 6009, Australia.
- huey@csse.uwa.edu.au
- Received August 16 2002, Comments to Authors December 31 2002, Accepted March 2 2003
C655: Improving the official specification of Java bytecode verification
- Abstract: Bytecode verification is the main mechanism to ensure type safety in the Java Virtual Machine. Inadequacies in its official specification may lead to incorrect implementations where security can be broken and/or certain legal programs are rejected. This paper provides a comprehensive analysis of the specification along with concrete suggestions for improvement.
- Alessandro Coglio
- Kestrel Institute, 3260 Hillview Avenue, Palo Alto, CA 94304, USA
- coglio@kestrel.edu
- Received 8 January 2002, Revised 4 August 2002, Accepted August 15 2002
C658: Modular specification of frame properties in JML
- Abstract:We present a modular specification technique for frame properties. The technique uses modifies clauses and abstract fields with declared dependencies. Modularity is guaranteed by a programming model that enforces data abstraction by preventing representation and argument exposure, a semantics of modifies clauses that uses a notion of ``relevant location,'' and by modularity rules for dependencies. For concreteness, we adapt this technique to the Java Modeling Language, JML.
- P. Muller, A. Poetzsch-Heffter, and G. T. Leavens
- Dept. of Computer Science, Iowa State University, 229 Atanasoff Hall, Ames, Iowa, 50011-1040, USA
- leavens@cs.iastate.edu
- Received October 1 2001, Revised October 2 2002, Accepted October 5 2002
C659: Parallel LU Factorization of Sparse Matrices on FPGA-Based Configurable Computing Engines
- Abstract:Reconfigurable computing has demonstrated in the last decade capability to significantly improve the performance of a wide range of computation-intensive applications. With steady advances in silicon technology, as predicted by Moore's law, Field-Programmable Gate Array (FPGA) technologies have enabled the implementation of System-On-a Programmable-Chip (SOPC) computing platforms, which, in turn, have given a significant boost to the field of reconfigurable computing. It is possible now to implement various specialized parallel machines within one silicon die. For example, LU factorization is widely used in engineering and science to solve efficiently large systems of linear equations. In this paper, we describe our design and implementation of a parallel machine on an SOPC development board, using multiple copies of the Altera soft configurable processor, namely Nios(r); we use this machine for LU factorization. Our implementation facilitates the efficient solution of linear equations at a cost much lower than that of supercomputers and networks of workstations. The intricacies of our FPGA-based design are presented along with tradeoff choices made for the purpose of illustration. Performance results prove the viability of our approach.
- Xiaofang Wang and Sotirios G. Ziavras
- Department of Electrical and Computer Engineering New Jersey Institute of Technology University Heights Newark, New Jersey 07102, USA
- ziavras@njit.edu
- Received October 19 2002, Comments to Authors December 31 2002, Accepted January 17 2003
C660: Frameworks for Incorporating Semantic Relationships into Object-Oriented Database Systems
- Abstract: A semantic relationship is a data modeling construct that connects a pair of classes or categories and has inherent constraints and other functionalities that precisely reflect the characteristics of the specific relationship in an application domain. Examples of semantic relationships include part-whole, ownership, materialization, and role-of. Such relationships are important in the construction of information models for advanced applications, whether one is employing traditional data modeling techniques, knowledge-representation languages, or object-oriented modeling methodologies. This paper focuses on the issue of providing built-in support for such constructs in the context of object-oriented database (OODB) systems. Most of the popular object-oriented modeling approaches include some semantic relationships in their repertoire of data modeling primitives. However, commercial OODB systems, which are frequently used as implementation vehicles, tend not to do the same. We will present two frameworks by which a semantic relationship can be incorporated into an existing OODB system. The first only requires that the OODB system support manifest type with respect to its instances. The second assumes that the OODB system has a special kind of metaclass facility. The two frameworks are compared and contrasted. In order to ground our work in existing systems, we show the addition of a part-whole semantic relationship both to the ONTOS DB/Explorer OODB system and the VODAK Model Language (VML).
- Michael Halper, Li-min Liu, James Geller, and Yehoshua Perl
- Dept. of Mathematics & Computer Science, Kean University, Union, NJ 07083 USA; Dept. of Applied Mathematics Chung Yuan Christian University, Chung-Li, Taiwan; CS Dept. NJIT Newark, NJ 07102 USA
- mhalper@kean.edu
- Received February 15 2002, Revised October 28 2002, Accepted November 2 2002
C661: Enhancing OODB Semantics to Support Browsing in an OODB Vocabulary Representation
- Abstract: In previous work, we have modeled a vocabulary given as a semantic network by an OODB (Object-Oriented Database). The OODB schema thus obtained provides a compact abstract view of the vocabulary. This enables fast traversal of the vocabulary by a user. In the semantic network vocabulary, the IS-A relationships express the specialization hierarchy. In our OODB modeling of the vocabulary, the SUBCLASS relationship expresses the specialization hierarchy of the classes and supports the inheritance of their properties. A typical IS-A path in the vocabulary has a corresponding shorter SUBCLASS path in the OODB schema. In the current paper we expose several cases where the SUBCLASS hierarchy fails to fully correspond to the IS-A hierarchy of the vocabulary. In these cases there exist traversal paths in the semantic network for which there are no corresponding traversal paths in the OODB schema. The reason for this failure is the existence of some IS-A relationships between concepts of two classes, that are not connected by a SUBCLASS relationship. This phenomenon weakens the accuracy of our modeling. To rectify the situation we introduce a new OODB semantic relationship IS-A' to represent the existence of IS-A relationships between concepts of a pair of classes which are not connected via a SUBCLASS relationship. The resulting schema contains both SUBCLASS relationships and IS-A' relationships which completely models the IS-A hierarchy of the vocabulary. We define a mixed class level traversal path to contain either SUBCLASS or IS-A' relationships. Consequently, each traversal path in the semantic network has a corresponding mixed traversal path in the OODB schema. Hence the introduction of the semantic OODB IS-A' relationship improves the modeling of semantic network vocabularies by OODBs.
- Li-min Liu, James Geller and Yehoshua Perl
- Department of Applied Mathematics,Chung Yuan Christian University 22, Pu-Zen, Pu-Chung Li, Chung-Li, Taiwan, R.O.C.; Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102 USA
- thescientist00@yahoo.com
- Received February 22 2002, Revised October 30 2002, Accepted November 2 2002
C662: Efficient Parallel Implementations of Unstructured Mesh Generation with High Performance Fortran
- Abstract:Unstructured mesh generation exposes highly irregular computation patterns, which imposes a challenge in implementing triangulation algorithms on parallel machines. This paper reports on an efficient parallel implementation of locally Delaunay triangulation with High Performance Fortran (HPF). Our algorithm exploits embarrassing parallelism by performing sub-block triangulation and boundary merge independently at the same time. The sub-block triangulation is by a divide and conquer Delaunay algorithm known for its sequential efficiency, and the boundary triangulation is by an incremental construction algorithm with low overhead. Compared to prior work, our parallelization method is both simple and efficient. In the paper, we also describe a solution to the collinear points problem that usually arises in large data sets. Our experiences with the HPF implementation show that with careful control of the data distribution, we are able to parallelize the program using HPF's standard directives and extrinsic procedures. Experimental results on several parallel platforms, including an IBM SP2 and a DEC Alpha farm, show that a parallel efficiency of 31%-66% can be achieved for an 8-node distributed memory system. We also compare efficiency of the HPF implementation to that of a similarly hand-coded MPI implementation.
- Min-Bin Chen, Tyng-Ruey Chuang and Jan-Jan Wu
- Department of Management Information Systems, Chung Kuo Institute of Technology, Mucha, Taipei 116, Taiwan; Institute of Information Science, Academia Sinica, Nankang, Taipei
- cmb@mail.ckitc.edu.tw
- Received November 23 2002, Comments to Authors March 8 2003
C663: A PC Cluster System Employing the IEEE 1394
- Abstract:In this paper, we describe the design and evaluation of a PC cluster system in which IEEE 1394 is applied. Networks for parallel cluster computing require low latency and high bandwidth. It is also important that the networks be commercially available at low cost. Few network devices satisfy all of the above requirements. However, the IEEE 1394 standard provides a good compromise for fulfilling these requirements. We have used IEEE 1394, which supports a 400 Mbps data transfer rate, to connect the nodes of a PC cluster system which we have designed and implemented. We have implemented two communication libraries. One is a fast communication library called CF for IEEE 1394. The other is an MPI layer library on the CF library. Experimental results show that CF achieves a 17.2 microsecond round-trip time. On application benchmarks, the system was considerably faster than TCP/IP over Fast Ethernet. Even though the system was constructed at very low cost, it provides good performance. Using the IEEE 1394 standard is thus a good solution for low-cost cluster systems.
- Kazuki Hyoudou, Ryota Ozaki and Yasuichi Nakayama
- Department of Computer Science, The University of Electro-Communications Chofu, Tokyo 182-8585 Japan
- hyoudo-k@igo.cs.uec.ac.jp
- Received November 29 2002, Comments to Authors March 8 2003
C664: Data Structures in Java for Matrix Computations
- Abstract:In this paper it is shown how to utilize Java arrays for matrix computations. We discuss the disadvantages of Java arrays when used as two-dimensional array for dense matrix computation, and how to improve the performance. We show how to create efficient dynamic data structures for sparse matrix computations using Java's native arrays. We construct a data structure for large sparse matrices that is unique for Java. This datastructure is shown to be more dynamic and efficient than the traditional storage schemes for large sparse matrices. Numerical results show that this new data structure, called Java Sparse Array, is competitive with the traditional Compressed Row Storage scheme on matrix computation routines. Java gives increased flexibility without loosing efficiency. Compared to other object oriented data structures it is shown that Java Sparse Arrays has the same flexibility.
- Geir Gundersen and Trond Steihaug
- Department of Informatics, University of Bergen, Norway
- Trond.Steihaug@ii.uib.no
- Received December 10 2002, Comments to Authors March 8 2003
C665(Performance Section): Performance and Scalability of MPI on PC Clusters
- Abstract:The purpose of this paper is to compare the communication performance and scalability of MPI communication routines on an NT cluster, a Myrinet Linux cluster, an Ethernet Linux cluster, a Cray T3E-600, and an SGI Origin 2000. All tests in this paper were run for the various numbers of processors and 2 message sizes. For most of the MPI tests used in this paper, the T3E-600 and Origin 2000 outperform the NT cluster, the Myrinet and Ethernet Linux clusters. In spite of the fact that the Cray T3E-600 is about 5 years old, it performs best of all machines for most of the tests. For mpi_bcast, mpi_allgather, and mpi_alltoall, the Myrinet Linux cluster outperforms the NT cluster. For all other MPI collective routines, the NT cluster outperforms the Myrinet Linux cluster. For all MPI collective routines, the Myrinet Linux cluster performs significantly better than the Ethernet Linux cluster.
- Glenn R. Luecke, Jing Yuan, Silvia Spanoyannis, Marina Kraeva
- 292 Durham Center Iowa State University Ames, Iowa 50011, USA
- grl@iastate.edu
- Received February 6 2001, Revised November 14 2002, Accepted December 3 2002
- Full Paper:../CCPEwebresource/c665pemcs16/c665pemcspaper16.pdf
C666: A role for Pareto optimality in mining performance data
- Abstract:Improvements in performance modeling and identification of computational regimes within software libraries is a critical first step in developing software libraries that are truly agile with respect to the application as well as to the hardware. It is shown here that Pareto ranking, a concept from multi-objective optimization, can be an effective tool for mining large performance datasets. The approach is illustrated using software performance data gathered using both the public domain LAPACK library and an asynchronous communication library based on IBM LAPI active message library.
- Joël M. Malard
- Pacific Northwest National Laboratory Battelle Boulevard, P.O. Box 999 Richland, WA 99352
- JM.Malard@pnl.gov
- Received December 31 2002, Comments to Authors March 8 2003
C667: A Performance Study of Job Management Systems
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- T. El-Ghazawi1
- The George Washington University, George Mason University
- tarek@seas.gwu.edu
- May 2002, Revised: Dec. 2002, Accepted: Dec. 2002
C668: Supporting Bulk Synchronous Parallelism with a High Bandwidth Optical Interconnect
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- I. Gourlay
- Informatics Research Institute, University of Leeds, Leeds LS2 9JT
- iain@comp.leeds.ac.uk
- Recieved: May 2002, Revised: Oct. 2002, Accepted: Dec. 2002
C669: Towards a More Realistic Comparative Analysis of Multicomputer Networks
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- H. Sarbazi-Azad
- Department of Computing Science, University of Glasgow, Glasgow G12 8QQ.
- azad@sharif.edu
- May 2002, Revised: Nov. 2002, Accepted: Dec. 2002
C670: RF-MVTC: An EÆcient Risk-Free Multiversion Concurrency Control Algorithm
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- A. Boukerche
- Department of Computer Sciences, University of North of Texas
- boukerche@cs.unt.edu
- Recieved: June 2002, Revised: Nov 2002, Accepted: Dec. 2002
C671: On the Performance of Circuit-Switched Networks in the Presence of Correlated Traffic
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- G. Min
- Department of Computing, School of Informatics, University of Bradford, Bradford, BD7 1DP, U.K.
- g.min@brad.ac.uk
- May 2002, Revised: Dec. 2002, Accepted: Dec. 2002
C672: Performance Evaluation of a Random-Walk-Based Algorithm for Embedding Dynamically Evolving Trees in Hypercubic Networks
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- K. Li
- Department of Computer Science, State University of New York, New Paltz, New York 12561-2499
- li@mcs.newpaltz.edu
- May 2002, Revised: Oct. 2002, Accepted: Dec. 2002
C673: On the Scalability of Many-to-many Reliable Multicast Sessions
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- W. Y. Yoon
- Samsung Advanced Institute of Technology, South Korea
- wyyoon@sait.samsung.co.kr
- June 2002, Revised: Dec. 2002, Accepted: Dec. 2002
C674: One-to-All Broadcasting Scheme for Arbitrary Static Interconnection Networks with Bursty Background Traffic
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- Demetres Kouvatsos
- Performance Modelling and Engineering Research Group, Department of Computing, University of Bradford BD7 1DP, Bradford, UK
- d.d.kouvatsos@scm.brad.ac.uk
- Received October 19 2002
C675: Editorial on Benchmarking Special Issue
- Special Issue: International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS'02), held in conjunction with IEEE IPDPS 2002, April 15-19, 2002, Fort Lauderdale, USA; editors Mohamed Ould-Khaoua and Geyong Min
- Mohamed Ould-Khaoua and Geyong Min
- Glasgow University; Bradford University
- mohamed@dcs.gla.ac.uk and G.Min@Bradford.ac.ukG.Min@Bradford.ac.uk
- Received February 16 2003
C676: Toward the Design of a Parallel, Linear Algebraic System
- Abstract:We describe a generic, software framework for the solution of linear algebraic systems sufficient to provide for their use in a parallel, iterative context. Some base assumptions about the structure of the underlying problem and problem domain, as well as the initial distribution of the parallel structures are described to provide a framework for the discussion of an implementation using templated C++ code; however, the design has been made sufficiently flexible that it could easily be applied to other implementations of the base data structures.
- W. D. Turner and J. E. Flaherty
- Visualization and Computer Vision, GE Global Research, Schenectady, NY 12301; Scientific Computation Research Center, and Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180
- flahej@rpi.edu
- Received February13 2003, Comments to Authors April 28 2003
C677: Linda implementations in Java for concurrent systems
- Abstract: This paper surveys a number of the implementations of Linda that are available in Java. It provides some discussion of their strengths and weaknesses, and presents the results from benchmarking experiments using a network of commodity workstations. Some extensions to the original Linda programming model are also presented and discussed, together with examples of their application to parallel processing problems.
- G. C. Wells, A. G. Chalmers, and P. G. Clayton
- Department of Computer Science, Rhodes University, Grahamstown, 6140, South Africa; Department of Computer Science, University of Bristol, Merchant Venturers Building, Bristol BS8 1UB, U.K.
- G.Wells@ru.ac.za
- Received February 13 2003, Comments to Authors April 28 2003
C678: Studying Protein Folding on the Grid: Experiences Using CHARMM on NPACI Resources under Legion
- Abstract:One benefit of a computational grid is the ability to run high-performance applications over distributed resources simply and securely. We demonstrated this benefit with an experiment in which we studied the protein-folding process with the CHARMM molecular simulation package over a grid managed by Legion, a grid operating system. High-performance applications can take advantage of grid resources if the grid operating system provides both low-level functionality as well as high-level services. We describe the nature of services provided by Legion for high-performance applications. Our experiences indicate that human factors continue to play a crucial role in the configuration of grid resources, underlying resources can be problematic, grid services must tolerate underlying problems or inform the user, and high-level services must continue to evolve to meet user requirements. Our experiment not only helped a scientist perform an important study, but also showed the viability of an integrated approach such as Legion’s for managing a grid.
- Anand Natrajan, Nancy Wilkins-Diehr, Anthony D. Fox, Michael Crowley, Marty A. Humphrey, Andrew S. Grimshaw, Charles L. Brooks III
- Department of Computer Science University of Virginia; San Diego Supercomputing Center University of California at San Diego; Department of Molecular Biology The Scripps Research Institute
- an4m@cs.virginia.edu
- Received February 6 2003, Comments to Author February 6 2003, Revised February 27 2003, Accepted March 2 2003
C679: Abstracting Asynchronous Object Interaction using First-Class Functions and Continuations in an Event-Driven System
- Abstract: The shift in focus of distributed computing from local-area networks to geographically-distributed environments highlights the need for new distributed programming models. In this work we explore the combination of a single-threaded event-driven execution model with a small number of concepts traditionally found in functional languages. We describe a distributed programming system, called Orfeo, based on a simple yet powerful procedural language and on asynchronous remote function calls. We discuss how to use continuations and first-class function values to construct several useful control abstractions, showing how a small set of powerful concepts can provide the basis for a highly asynchronous and flexible distributed programming system.
- Maria Julia de Lima, Noemi Rodriguez, Roberto Ierusalimschy
- PUC-Rio
- noemi@inf.puc-rio.br
- Received March 14 2003
C680: Parallel Object-Oriented Framework Optimization
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Daniel P. Quinlan, Markus Schordan, Brian Miller and Markus Kowarschik
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore CA USA; System Simulation Group, Department of Computer Science, University of Erlangen-Nuremberg, Germany
- dquinlan@llnl.gov
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C681: A compiler for multiple memory models
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- S. P. Midkiff, J. Lee and D.A. Padua
- School of Electrical and Computer Engineering, Purdue University; School opf Computer Science and Engineering, Seoul National University; Department of Computer Science, University of Illinois at Urbana-Champaign
- smidkiff@purdue.edu
- Received Sept 20 2001, Revised May 29 2002, Accepted July 1st 2002
C682: Compiling Data-Parallel Programs for Clusters of SMPs
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Siegfriend Benkner and Thomas Brandes
- Institute of Software Science, University of Vienna Austria; Fraunhofer Institute for Algorithms and Scientific Computation, St. augustin Germany
- sigi@ieee.org
- Received Sept 20 2001, Revised May 29 2002, Accepted July 1st 2002
C683: Group-SPMD programming with orthogonal processor groups
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Thomas Rauber, Robert Reilein and Gudula Runger
- Fak. f¨ur Mathematik und Physik, Universit¨at Bayreuth, 95440 Bayreuth, Germany; Fak. f¨ur Informatik, Technische Universit¨at Chemnitz, 09107 Chemnitz, Germany.
- rauber@uni-bayreuth.de
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C684: The Effect of Cache Models on Iterative Compilation for Combined Tiling and Unrolling
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- P.M.W. Knijnenburg, T. Kisuki, K. Gallivan and M.F.P. O'Boyle
- LIACS, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, the Netherlands peterk@liacs.nl Phone: +31 71 5277063 Fax: +31 71 5276985; Department of Computer Science, Florida State University; Institute for Computing Systems Architecture, Edinburgh University
- peterk@liacs.nl
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C685: SWARP: A Retargetable Preprocessor for Multimedia Instructions
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Gilles Pokam, Stephane Bihan, Julien Simonnet, Francois Bodin
- IRISA, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France
- gpokam@irisa.fr
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C686: Space-Time Mapping and Tiling – a Helpful Combination
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Martin Griebl, Peter Faber, Christian Lengauer
- Fakultat fur Mathematik und Informatik, Universit¨at Passau D-94030, Passau, Germany
- griebl@fmi.uni-passau.de
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C687: Data Partitioning-Based Parallel Irregular Reductions
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Eladio Gutierrez, Oscar Plata, Emilio L. Zapata
- Department of Computer Architecture, University of Malaga E-29071, Malaga, Spain
- eladio@ac.uma.es
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C688: A Fast and Accurate Method for Evaluating the Upper-Bound of Memory Performance
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- G. Fursin. M.F.P. O'Boyle, O. Temam, G. Watts
- ICSA Division of Informatics, University of Edinburgh JCMB, King's Buildings Mayfield Rd, Edinburgh EH9 3JZ, UK; Paris South University, France
- fgg@dcs.ed.ac.uk
- Received Sept 20 2001, Revised May 29 2002, Accepted March 14 2003
C689: Managing distributed shared arrays in a bulk-synchronous parallel programming environment
- One of ten papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Christoph W. Kessler
- Programming Environments Laboratory (PELAB), Department of Computer Science, Linkoping University, S-581 83 Linkoping, Sweden
- chrke@ida.liu.se
- Received Sept 20 2001, Revised May 29 2002, Accepted July 1st 2002
C690: Introduction to Parallel Computer Compiler Special Issue
- Editorial for papers in Special Edition of Concurrency: Practice and Experience dedicated to selected papers presented at the ninth Compilers for Parallel Computers workshop held in Edinburgh, UK in June, 2001 (CPC 2001).
- Michael O'Boyle
- ICSA Division of Informatics, University of Edinburgh JCMB, King's Buildings Mayfield Rd, Edinburgh EH9 3JZ, UK
- mob@inf.ed.ac.uk
- Received March 14 2003, Accepted March 14 2003
C691: A Static Mapping Heuristic for Mapping Parallel Applications to Heterogeneous Computing Systems
- Abstract: To minimize the execution time of a parallel application running on a heterogeneous computing distributed system, an appropriate mapping scheme to allocate the application tasks to the processors is needed. The general problem of mapping tasks to machines is a well known NP-hard problem and several heuristics have been proposed to approximate its optimal solution. In this paper we propose a static graph-based mapping algorithm, called Heterogeneous Multi-phase Mapping ($HMM$), that permits a suboptimal mapping of a parallel application onto a heterogeneous computing distributed system by using a local search technique together with a tabu search meta-heuristic. $HMM$ allocates parallel tasks by exploiting the information embedded in the parallelism forms used to implement an application. We compare HMM with three different leading techniques and with an exhaustive mapping algorithm. We also give an example of mapping of a pratical application where HMM verified its usefulness. Experimental results show that $HMM$ performs well demonstrating the applicability of our approach.
- Ranieri Baraglia, Renato Ferrini, Pierluigi Ritrovato
- Istituto di Scienza e Tecnologie dell'Informazione A. Faedo, Consiglio Nazionale delle Ricerche Area della Ricerca di Pisa Via Moruzzi 1 56126 Pisa, Italy; Centro di Ricerca in Matematica Pura ed Applicata, Università degli Studi di Salerno Ponte Don Melillo 84084 Fisciano (Salerno), Italy.
- ranieri.baraglia@cnuce.cnr.it
- Received March 24 2003
C692: Concurrent Urban-Legends
- Abstract: This discussion addresses a number of urban legends about concurrency in an attempt to separate the myth from the fact. To dispel certain legends, we argue that:
  - Concurrent != Parallel
  - Coroutining != Concurrency
  - Synchronization != Mutual Exclusion
  - Dekker !< Peterson
  - Concurrency != Library
  - Inheritance Anomaly != Major Concurrency Problem
  - Signalling != Hints
  - Spurious Wakeup != Efficiency
  - Hyman != Peer-Review Failure
  Identifying and understanding the fundamental concepts underlying concurrency is essential to the field. Equally important is not to confuse sequential and concurrent concepts. Finally, approaches based solely on efficiency are insufficient to justify a weak or difficult to use concurrent concept or construct.
- Peter A. Buhr and Ashif S. Harji
- University of Waterloo 200 University Ave. West Waterloo, Ontario, CANADA, N2L 3G1
- pabuhr@plg2.math.uwaterloo.ca
- Received March 24 2003
C693: A Cache-Efficient Implementation of the Lattice Boltzmann Method for the Two-Dimensional Diffusion Equation
- Abstract:The lattice Boltzmann method is an important technique for the numerical solution of partial differential equations because it has nearly ideal scalability on parallel computers for many applications. However, to achieve the scalability and speed potential of the lattice Boltzmann technique, the issues of data reusability in cache-based computer architectures must be addressed. Utilizing the two-dimensional diffusion equation, T_t = µ (T_xx + T_yy), this paper examines cache optimization for the lattice Boltzmann method in both serial and parallel implementations. In this study speedups due to cache optimization were found to be 1.9 to 2.5 for the serial implementation and 3.6 to 3.8 for the parallel case in which the domain decomposition was optimized for stride-one access. In the parallel non-cached implementation the method of domain decomposition (horizontal or vertical) used for parallelization did not significantly affect the compute time. In contrast the cache-based implementation of the lattice Boltzmann method was significantly faster when the domain decomposition was optimized for stride-one access. Additionally, the cache-optimized lattice Boltzmann method in which the domain decomposition was optimized for stride-one access displayed superlinear scalability on all problem sizes as the number of processors was increased.
- A. C. Velivelli and K. M. Bryden
- Department of Mechanical Engineering Iowa State University, Ames, IA 50011
- kmbryden@iastate.edu
- Received March 29 2003
C694: Finding stale-value errors in concurrent programs
- Abstract:Concurrent programs can suffer from many types of errors, not just the well-studied problems of deadlocks and simple race conditions on variables. This paper addresses a kind of race condition that arises from reading a variable whose value is possibly out-of-date. The paper introduces a simple technique for detecting such stale values, and reports on the encouraging experience with a compile-time checker that uses the technique.
- Michael Burrows and K. Rustan M. Leino
- Compaq Systems Research Center (Currently Microsoft Research)
- leino@microsoft.com
- Received May 1 2003

Submitted in 2001 (top)

Here are instructions for Grid Computing Environment Special Issue (CLOSED) due Mid July 2001 and here are submitted papers.
Here are instructions for Java Grande/ISCOPE Special Issue (CLOSED) due August 31 2001 and here are papers
C496: Open Consensus
- Abstract:This paper presents the abstraction of open consensus and argues for its use as an effective component for building reliable agreement protocols in practical asynchronous systems where processes and links can crash and recover.The specification of open consensus has a decoupled on-demand and re-entrant behaviour that make its use very efficient,especially in terms of forced logs, which are known to be major sources of overhead in distributed systems.We illustrate the use of open consensus as a basic building block to develop a modular, yet efficient, total order broadcast protocol.Finally, we describe our Java implementation of our open consensus abstraction and we convey our efficiency claims through some practical performance measures.
- Romain Boichat, Svend Frølund, Rachid Guerraoui
- Swiss Federal Institute of Technology, CH 1015, Lausanne and Hewlett-Packard Laboratories, 1501 Page Mill Rd, Palo Alto
- Email: Romain.Boichat@epfl.ch
- Received January 17 2001, Comments to Authors May 8 2001, Accepted May 19 2001
C497: Java for High-Performance Computing
- Abstract:There has been an increasing research interest in extending the use of Java towards high-performance demanding applications such as scalable web servers, multimedia applications, and large-scale scientific applications. However, given the low performance of current Java implementations, these application domains pose new challenges to both the application designer and systems developer. In this paper we describe and classify several important proposals and environments that tackle Java's performance bottlenecks in order to make the language an effective option for high-performance computing. We further survey most significant performance issues while exposing the potential benefits and limitations of current solutions in such a way that a framework for future research efforts can be established. We show that most of the proposed solutions can be classified according to some combination of the three basic parameters: the model adopted for inter-process communication, language extensions, and the implementation strategy. In addition, we examine other relevant issues, such as interoperability, portability, and garbage collection.
- M. Lobosco, C. Amorim, and O. Loques
- Instituto de Computação Universidade Federal Fluminense Niterói, Rio de Janeiro, Brazil
- Email: loques@ic.uff.br
- Received January 15 2001, Comments to authors May 28 2001, Accepted July 24 2001
C499: Data Movement and Control Substrate for Parallel Adaptive Applications
- Abstract:In this paper, we present the Data Movement and Control Substrate (DMCS), that implements one-sided low-latency communication primitives for parallel adaptive and irregular computations. DMCS is built on top of low-level, vendor-specific communication subsystems such as LAPI for IBM SP machines and widely available libraries like MPI for clusters of workstations and PCs. DMCS adds a small overhead to the communication operations provided by the lower communication system. In return DMCS provides a flexible and easy to understand application program interface for one-sided communication operations including put-ops and remote procedure calls. Furthermore, DMCS is designed so that it can be easily ported and maintained by non-experts.
- Jeffrey Dobbelaere, Kevin Barker, Nikos Chrisochoides, and Demian Naves, Keshav Pingali
- Computer Science Department College of William & Mary Williamsburg, VA 23185.
- Computer Science Department Cornell University Ithaca, NY 14853-3801
- Email: nikos@plato.cs.wm.edu
- Received 11 March 2001, Comments to Authors July 24 2001, Accepted October 28 2001
C500: (ACES Special Issue) Parallel visualization of gigabyte datasets in GeoFEM
- Abstract:Parallel visualization of large datasets in GeoFEM is described. Our visualization subsystem supports concurrent visualization with computation, and outputs a simplified small graphic primitive set, which can be displayed by a basic viewer on clients. This subsystem provides many kinds of parallel visualization algorithms for the users to visualize their data from scalar, vector to tensor. Scalar field topology analysis, flow field topology and semi-automatic parameter design are employed to improve the quality of visualization results. We also present a simplification algorithm to reduce the number of output graphic primitives, accounting for both shape attribute and color attribute. The experimental results show the effectiveness of our subsystem.
- Issei Fujishiro, Yuriko Takeshima, Li Chen, Hiroko Nakamura , and Yasuko Suzuki
- Ochanomizu University, Tokyo, Japan
- Research Organization for Information Science & Technology, Tokyo, Japan
- State Key Lab. of CAD&CG, Zhejiang University, Hangzhou, 310027, P.R. China.
- Email: chen@tokyo.rist.or.jp
- Received 26 February 2001, Comments to Authors August 5 2001, Accepted November 9 2001
C501: (ACES Special Issue)Parallelization of a Large-Scale Computational Earthquake Simulation Program
- Abstract:Here we detail both the methods and preliminary results of first efforts to parallelize two General Earthquake Model (GEM)-related codes: 1) a relatively simple data mining procedure based on a Genetic Algorithm; and 2) the Virtual California simulation of GEM . These preliminary results, using a simple, heterogeneous processor system, existing freeware, and with an extremely low cost of both manpower and hardware dollars, motivate us to more ambitious work with considerably larger-scale computer earthquake simulations of southern California. The GEM computational problem, which is essentially a Monte Carlo simulation, is well suited to optimization on parallel computers, and we outline how we are proceeding in implementing this new software architecture.
- K.F. Tiampo , J.B. Rundle , S. Gross and S. McGinnis
- Colorado Center for Chaos & Complexity, CIRES, University of Colorado, Boulder, CO, 80309, USA
- Email: kristy@caldera.colorado.edu
- Received 23 February 2001, Comments to Authors August 5 2001, Accepted November 3 2001
C502: (ACES Special Issue)Parallel Iterative Solvers for Unstructured Grids using Directive/MPI Hybrid Programming Model for GeoFEM Platform on SMP Cluster Architectures
- Abstract:In this study, efficient parallel iterative method for unstructured grid has been developed for SMP (shared memory symmetric multiprocessor) cluster architectures on GeoFEM platform. 3-level hybrid parallel programming model has been applied : 1. message passing for inter SMP node communication, 2. loop directive for intra SMP node parallelization and 3. vectorization for each PE (processing element). Simple 3D elastic linear problems with more than 10 8 DOFs have been solved by 3x3 block ICCG(0) with additive Schwartz domain decomposition and PDJDS/CM-RCM ((Parallel Descending-order Jagged Diagonal Storage/ Cyclic Multicolor-Reverse Cuthil Mckee)) re-ordering on 16 SMP nodes of Hitachi SR8000 and 20 GFLOPS performance has been obtained. PDJDS/CM-RCM reordering method provides excellent vector and parallel performance in SMP nodes. This re-ordering is essential for parallelization of forward/backward substitution in IC/ILU factorization with global data dependency. Developed method was also tested on NEC SX-4 and attained 969 MFLOPS (48.5% of peak performance) using single processor. Additive Schwartz domain decomposition method provided robustness to the GeoFEM's parallel iterative solvers with localized preconditioning.
- Kengo Nakajima and Hiroshi Okuda
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Department of Quantum Engineering and Systems Science, The University of Tokyo, Tokyo, Japan
- Email: nakajima@tokyo.rist.or.jp
- Received 27 February 2001, Comments to Authors August 5 2001, Accepted October 26 2001
C503: (ACES Special Issue)Finite Element Modeling of Multibody Contact and Its Application to Active Faults
- Abstract:Earthquakes have been recognized as resulting from a stick-slip frictional instability along the faults between deformable rocks. An arbitrarily shaped contact element strategy, named as node-to-point contact element strategy is proposed to handle the friction contact between deformable bodies with stick and finite frictional slip and extended to simulate the active faults in the crust with a more general nonlinear friction law. Also introduced is an efficient contact search algorithm for contact problems among multiple small and finite deformation bodies. Moreover, the efficiency of the parallel sparse solver for the nonlinear friction contact problem is investigated. Finally, a model for the plate movement in the northeast zone of Japan under gravitation is taken as an example to be analyzed with different friction behaviors.
- H.L. Xing and A. Makinouchi
- Materials Fabrication Lab, The Institute of Physical and Chemical Research (RIKEN) 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
- Email: xing@postman.riken.go.jp
- Received 28 February 2001, Comments to Authors August 5 2001, Accepted October 26 2001
C504: (ACES Special Issue) Effective Adaptation Technique for Hexahedral Mesh
- Abstract:This paper describes simple and effective adaptive techniques for the finite element method.The proposed refinement method is based on a modified octree divsion. This modified octree-based method can be applied to an arbitary hexahedral unstructured mesh. Hex-R is the the name of our implentation and it refines the mesh in cooperation with a finite element viewer: GPPView. The refined mesh retains the topology of the original coarse mesh. Therefore one can easily use the proposed method for multigrid generation. Finally the construction of an adaptive method using Hex-R is mentioned.
- Yoshitaka Wada
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Email: wada@tokyo.rist.or.jp
- Received March 2 2001, Comments to Authors August 5 2001, Accepted October 26 2001
C505: (ACES Special Issue)Thermal Convection Analysis in a Rotating Shell by a Parallel FEM - Development of a Thermal-Hydraulic Subsystem of GeoFEM
- Abstract:We carry out a numerical simulation of thermally driven convection in a rotating spherical shell modeled on the Earth ’s outer core using Thermal-Hydraulic subsystem of GeoFEM,which gives a parallel FEM platform.This simulation is for understanding of the origin of the geomagnetic field and dynamics of a fluid in the Earth’s outer core.We solve a three-dimensional and time dependent process of a Boussinesq fluid in a rotating spherical shell under effects of self gravity and the Coriolis force.In this simulation,tri-linear hexahedron element is applied for spatial distribution,and 2nd-order Adams-Bashforth scheme are applied for time integration of the temperature and velocity.To satisfy the mass conservation,a parallel iterative solver of GeoFEM is applied for solving pressure and correction of the velocity fields.To verify our simulation code,results of the simulation are compared with the same analysis by the spectral method.In these results,outline of convection is almost same in these cases;that is,three pairs of convection columns are formed,and these columns propagate to the westward in quasi-steady state.However, magnitude of kinetic energy averaged over the shell is about 93%of that in the case by the spectral method,and drift frequency of the columns in the GeoFEM case is larger than that in the case by the spectral method.
- Hiroaki Matsui and Hiroshi Okuda
- Department of Research for Computational Earth Science, Research Organization for Information Science & Technology (RIST), Tokyo
- Department of Quantum Engineering and System Science, The University of Tokyo, Tokyo, Japan
- Email: matsui@tokyo.rist.or.jp
- Received March 3 2001, Comments to Authors August 5 2001, Accepted October 26 2001
C506: (ACES Special Issue) Performance Optimization of GeoFEM on Various Computer Architecture
- Abstract:We have a prospect that Geofem get good parallel performance on various computer architecture by the research which has been made in GeoFEM team. In this research, we focus a target to performance with a single processor, and common data structure / coding manner to make GeoFEM running high performance on various computer architecture was researched.
  Test coding for Solver Part of the structure and the fluid analysis code Data structure and coding manner was evaluated on scalar/vector/pseudo vector architecture. A new data structure and a new direct access manner was introduced in the fluid analysis solver. As a result, 21% performance for peak performance was obtained on pseudo vector architecture. And We got as good performance as the pseudo-vector execution performance in scalar execution. 25% performance for peak performance was obtained on vector architecture. A new direct access manner was introduced in the structure code solver. As a result, 27% performance for peak performance was obtained on pseudo vector architecture. 34% performance for peak performance was obtained on vector architecture.
  
  Test Code for Matrix Assemble Part of the structure analysis code Coding of removing dependency was finished and performance was evaluated on vector/scalar machine. 736.8MFlops was obtained at matrix assembling process on SX-4. 900.7MFlops was obtained at whole test Code on SX-4, and 2.06GFlops was obtained on VPP5000 (peak : 9.6GFlops). 124MFlops was obtained at matrix assembling process on Alpha system (Alpha 21164 533MHz).
- Kazuo Minami
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Email: minami@tokyo.rist.or.jp
- Received March 5 2001, Comments to Authors August 5 2001, Accepted October 27 2001
C507: (ACES Special Issue)Parallel Multilevel Iterative Linear Solvers with Unstructured Adaptive Grids for Simulations in Earth Science
- Abstract:Preconditioned iterative solver is one of the most powerful choice such as IC (Incomplete Cholesky) or ILU (Incomplete LU) factorization method for large-scale scientific computation. But in these methods, iteration number until convergence increases as the problem size becomes larger. In multigrid solvers, the rate of convergence is independent of problem size and the number of iterations remains fairly constant. Multigrid is also a good preconditioner for Krylov iterative solvers. In this study, multigrid preconditioned conjugate gradient iterative method (MGCG) on parallel computers has been developed and applied to the Poisson equation in the region between dual sphere surfaces on semi-unstructured prismatic grids generated adaptively. Moreover this procedure has been applied to the grids with local refinement. Computational results on Hitachi SR2201 using up to 128 processors show good scalability of the developed method.
- Kengo Nakajima
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Email: nakajima@tokyo.rist.or.jp
- Received 19 March 2001, Comments to Authors August 5 2001, Accepted October 27 2001
C508: Comparing Windows NT, Linux, and QNX as the Basis for Cluster Systems
- Abstract:Clusters use commodity hardware and software components to provide an environment for parallel processing. A major issue in the development of a cluster system is the choice of the operating system that will run on each node. We compare three alternatives: Windows NT, Linux, and QNX — a real-time microkernel. The comparison is based on expressive power, performance, and ease-of-use metrics. The result is that none of these systems has a clear advantage over the others in all the metrics, but that each has its strong and weak points. Thus any choice of a base system will involve some technical compromises, but not major ones.
- Avi Kavas, Dror G. Feitelson
- School of Computer Science and Engineering The Hebrew University, 91904 Jerusalem, Israel
- Email: feit@cs.huji.ac.il
- Received 11 February 2001, Comments to authors May 28 2001, Accepted July 24 2001
C509: (ACES Special Issue)Parallel Simulation System for Earthquake Generation: Fault Analysis Modules and Parallel Coupling Analysis
- Abstract:Solid earth simulations have recently been developed to address issues such as natural disasters, global environmental destruction and the conservation of natural resources. The simulation of solid earth phenomena involves the analysis of complex structures including strata, faults, and heterogeneous material properties. Simulation of the generation and cycle of earthquakes is particularly important solid earth phenomena, but such a simulation requires analysis of a complex fault dynamics. GeoFEM (Iizuka et al. , 1999) is a parallel finite element analysis system intended for problems of solid earth field phenomena. This paper shows recent research of GeoFEM for simulation of earthquake generation and cycles.
  Key words: Solid earth simulations, Generation and cycle of earthquakes, GeoFEM, Parallel finite element analysis.
- Mikio Iizuka, Daigo Sekita, Hisashi Suito, Mamoru Hyodo, Karzuro Hirahara, David Place, Peter Mora, Osamu Hazama, and Hiroshi Okuda
- Organization for Information Science & Technology. Mitsubishi Reserch Institute,Inc. Earth and Planetary Sciencs, Nagoya University QUAKES, Department of Earth Sciences, The University of Queensland. Yokohama National University. Department of Quantum Engineering and Systems Science, The University ofTokyo
- Email: iizuka@tokyo.rist.or.jp
- Received 23 April 2001, Comments to Authors August 5 2001, Accepted October 27 2001
C514: Parallel Dynamic Scheduling in a Cluster of Computers
- Abstract:The parallelization of complex planning and control problems arising in diverse application areas in the industrial, services, and commercial environments not only allows the determination of control variables in the required times but also improves the performance of the control procedure as more processors are involved in the execution of the parallel program. In this paper we describe a scheduling application in a water supply network to demonstrate the benefits of parallel processing. The procedure we propose combines dynamic programming with genetic algorithms and time series prediction in order to solve problems in which decisions are made in stages, and the states an control belong to a continuous space. Taking into account the computational complexity of these applications and the time contraints that are usually imposed, the procedure has been implemented by a parallel program in a cluster of computers, an inexpe sive and widely extended platform that can make parallelism a practical means of tackling complex problems in many different environments.
- M. Damas, M. Salmeron, J. Ortega, G. Olivares, H. Pomares
- Department of Computer Architecture and Computer Technology, University of Granada. Facultad de Ciencias. Campus Fuentenueva s/n. E-18071, Granada, Spain
- Email: mdamas@atc.ugr.es
- Received 13 November 2000, Comments to authors June 2 2001, Accepted July 24 2001
C515: A Parallel Algorithm for Static Slicing of Concurrent Programs
- Abstract:Slicing of concurrent programs is a compute-intensive task. To speed up the slicing process, we develop a parallel algorithm. For this purpose we use a Concurrent Control Flow Graph (CCFG) as the intermediate representation. We use a network of communicating processes to develop our parallel algorithm. We have implemented our parallel algorithm and the experimental results appear promising.
- D. Goswami, R. Mall
- Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur INDIA
- Email: diganta@cse.iitkgp.ernet.in
- Received 10 May 2001; comments to Authors August 5 2001
C516: An Analysis of VI Architecture Primitives in Support of Parallel and Distributed Communication
- Abstract:We present the results of a detailed study of the Virtual Interface (VI) paradigm as a communication foundation for a distributed computing environment. Using Active Messages and the Split-C global memory model, we analyze the inherent costs of using VI primitives to implement these high-level communication abstractions. We demonstrate a minimum mapping cost (i.e. the host processing required to map one abstraction to a lower abstraction) of 5.4 microsec for both Active Messages and Split-C using 4-way 550 MHz Pentium III SMPs and the Myrinet network. We break down this cost to use of individual VI primitives in supporting flow control, buffer management and event processing and identify the completion queue as the source of the highest overhead. Bulk transfer performance plateaus at 44 MB/sec for both implementations due to the addition of fragmentation requirements. Based on this analysis, we present the implications for the VI successor, Infiniband.
- Andrew Begel, Philip Buonadonna, David E. Culler, David Gay
- University of California, Berkeley
- Email: philipb@cs.berkeley.edu
- Received 9 May 2001; Comments to Authors August 19 2001, Accepted September 15 2001
C517: Automatic Determination of Matrix-Blocks
- Abstract:Many sparse matrices have a natural block structure, for instance arising from the discretisation of a physical domain. We give an algorithm for finding this block structure from the matrix sparsity pattern. This algorithm can be used for instance in iterative solver libraries in cases where the user does not or can not pass this block structure information to the software. The block structure can then be used as a basis for domain-decomposition based preconditioners.
- Victor Eijkhout
- Department of Computer Science; University of Tennessee, Knoxville, TN 37996
- Email: eijkhout@cs.utk.edu
- Received 5 June 2001, Comments to Authors October 28 2001
C518: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels
- Abstract:Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three dfferent representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches.
- Vinod Valsalam and Anthony Skjellum
- High Performance Computing Laboratory, Department of Computer Science, Mississippi State University, MS 39762.
- Email: tony@hpcl.cs.msstate.edu
- Received June 1 2001, Comments to Authors October 28 2001, Accepted November 21 2001
C519: Parallel Implementation of the Fluid Particle Model for Simulating Complex Fluids in the Mesoscale
- Abstract:Dissipative particle dynamics (DPD) and its generalization - the fluid particle model (FPM) - represent the “fluid particle” approach for simulating fluid-like behavior in the mesoscale. Unlike particles from molecular dynamics (MD) method, the “fluid particle” can be viewed as a “droplet” consisting of liquid molecules. In FPM, “fluid particles” interact by both central and non-central, short-range forces with conservative, dissipative and Brownian character. In comparison to MD, FPM method in 3-D requires two to three times more memory load and three times more communication overhead. Computational load per step per particle is comparable to MD due to the shorter interaction range allowed between “fluid particles” than between MD atoms. The classical linked-cells technique and decomposing the computational box into strips allow for rapid modifications of the code and for implementing non-cubic computational boxes. We show that the efficiency of the FPM code depends strongly on the number of particles simulated, geometry of the box, and the computer architecture. We give a few examples from long FPM simulations involving up to 8 million fluid particles and 32 processors. Results from FPM simulations in 3-D of the phase separation in binary fluid and dispersion of colloidal slab are presented. Scaling law for symmetric quenching in phase separation has been properly reconstructed. We show also that the microstructure of dispersed fluid depends strongly on the contrast between kinematic viscosities of this fluid phase and the bulk phase. This FPM code can be applied for simulating mesoscopic flow dynamics in capillary pipes or critical flow phenomena in narrow blood vessels.
- Krzysztof Boryczko, Witold Dzwinel, David A.Yuen
- AGH Institute of Computer Science, al. Mickiewicza 30, 30-059, Kraków, Poland, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota 55415-1227,USA
- Email: dwitek@msi.umn.edu
- Received June 25 2001, Comments to Authors October 28 2001, Accepted November 10 2001
C520: (ACES Special Issue)Interaction between Seismogenic Body and Force-Supplying Body and Precursor in Far Field
- Abstract:Seismogenic body and wall rock of force-supply are analogized respectively by rocks sample and block of transferring pressure in test of fracture. Their interactions are used to study precursor in far field and ultra-far field. Observation studies of stress and strain are made on them. The results show that there appears mutation or disturbance precursor of strain in the force-measuring sensor or on block of transferring pressure before rock fracture beside anomaly variation appearing in all observing points on sample. The observing points on lock of transferring pressure are 25-90 cm away from sample, which are 3-10 times of fracture sizes. It is explained that precursor anomaly is possible appearing in far places from seismogenic area. The reason forming far field precursor is primarily studied.
- Xu Zhaoyong, Mei Shirong, Yang Runhai, Wang Jinglai, Zhao Jinming, Wang Bin, Hua Peizhong
- Seismological Bureau of Yunnan Province, Kunming, 650041, P. R. of China, Center for Analysis and Prediction, CSB, Beijing, 100036, P. R. of China
- Email: xuzhaoyong@netease.com
- Received Jan 12 2001, Withdrawn August 7 2001
C521: A Java Extension Framework for Web-based Simulation
- Abstract:We have designed a framework for the web-based simulation applications with a Java front-end. In this paper, we discuss the architectural design and related considerations of this framework. We present details of a multi-layer architecture, key components and their functions, and the abstraction of the common features in the web-based simulation into this framework. This framework is implemented in Java technology with object-oriented design. This framework can incorporate an arbitrary legacy simulation application in various domains by using configurable and extensible interfaces. The usage of our framework can be in the area of developing Java web-based simulation services and simulation problems.
- Tao Tang and Chu R. Wie
- State University of New York at Buffalo, Dept. of Electrical Engineering, Bonner Hall, Buffalo, NY 14260
- Email: wie@eng.buffalo.edu>
- Received 22 July 2001, Comments to Authors October 28 2001
C522: WebSimMicro: Web-based Simulation of Microelectronic Devices
- Abstract:We have developed a web-based Java front-end for microelectronic device simulation applications. We used a framework, that we have previously proposed, for extending legacy simulation applications to the web. This we shall call WebSimMicro, the Web-based Simulation on Microelectronic devices. In this paper, we discuss the architecture of WebSimMicro and the design approach using the proposed framework. This WebSimMicro project contains a series of semiconductor device simulation applets, including applets for metal-oxide-semiconductor transistors and bipolar transistors which can be accessed on our website.
- Tao Tang and Chu R. Wie
- State University of New York at Buffalo, Dept. of Electrical Engineering, Bonner Hall, Buffalo, NY 14260
- Email: wie@eng.buffalo.edu>
- Received 22 July 2001, Comments to Authors October 28 2001
C523: Towards a Common Development Framework for Distributed Applications
- Abstract:The last decade has witnessed a convergence in several concerns related to the development ofdis- tributed applications ranging from concurrency issues to architectural platforms. However, these concerns are often addressed within the specicities of a particular application area or a specic requirement such as performance. This paper is a contribution towards dening a common devel- opment framework specically for distributed applications. It denes the salient characteristics of such applications and discusses emerging concepts and techniques from a number of disciplines namely concurrency theory, real-time systems, distributed systems and parallel processing. It concludes on their possible role in an integrated development framework that would be suitable for distributed applications.
  Author's Notes:
  - The paper is more structured as a survey/tutorial than a traditional research paper.
  - Its main value is to broaden parallel processing researchers' thinking into considering a much wider perspective which encompasses distributed applications in general.
  - It is not exhaustive nor complete (it can't be, given the broad scope) but it has some value in explaining a lot of the jargon being used these days.
  - It is quite long, but structured in a way that makes it easy to skip sections for readers that are familiar with a particular research area.
- Fethi A. Rabhi
- School of Information Systems, University of New South Wales, Sydney 2052, Australia.
- Email: f.rabhi@unsw.edu.au
- Received 23 July 2001, Comments to Authors October 28 2001
C524: Lesser Bear: a lightweight process library for SMP computers - scheduling mechanism without a lock operation
- Abstract:We have designed and implemented a lightweight process (thread)library called ‘Lesser Bear ’for SMP computers. Lesser Bear has thread-level parallelism and high portability. Lesser Bear executes threads in parallel by creating UNIX processes as virtual processors and a memory-mapped file as a huge shared-memory space.To schedule thread in parallel,the shared-memory space has been divided into working spaces for each virtual processor,and a ready queue has been distributed. However the previous version of Lesser Bear sometimes requires a lock operation for dequeueing.We therefore proposed a scheduling mechanism that does not require a lock operation.To achieve this,each divided space forms a link topology through the queue,and we use a lock-free algorithm for the queue operation.This mechanism is applied to Lesser Bear and evaluated by experimental results.
- Hisashi Oguma and Yasuichi Nakayama
- c/o Prof. Y. Nakayama, Department of Computer Science, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu, Tokyo 182-8585 JAPAN
- oguma-h@igo.cs.uec.ac.jp
- Received 30 July 2001, Comments to Authors October 28 2001, Accepted November 22 2001
C525: Object-Oriented Distributed Computing Based on Remote Class Reference
- Abstract:Java RMI and Jini provide effective mechanisms for implementing a distributed computing system. Recently many numeral libraries have been developed that take advantage of Java as an object-oriented and portable language. The widely-used client-server method limits the extent to which the benefits of the object-oriented approach can be exploited because of the difficulties arising when a remote object is the argument or return value of a remote or local method. In this paper this problem is solved by introducing a data object that stores the data structure of the remote object and related access methods. By using this data object, the client can easily instantiate a remote object, and use it as the argument or return value of either a local or remote method.
- Yan Huang, David W. Walker, and Omer F. Rana
- Department of Computer Science, Cardiff University, PO Box 916, Cardiff, CF24 3XF, United Kingdom
- yan.huang@cs.cf.ac.uk
- Received August 2 2001, Comments to Authors November 11 2001, Accepted January 20 2002
C526: A stochastic load balancing algorithm for i-Computing
- Abstract:This paper presents a stochastic dynamic load balancing algorithm for Internet Computing, which is a new type of distributed computing involving heterogeneous workstations from different organizations on Internet. To realize the practical environment, we assume the system comprised of heterogeneous, untrusted and non-dedicated workstations connected by non-dedicated network. Our algorithm uses the product of the average processing time and the queue length of system jobs as the load index. Dynamic communication delay is included in the execution cost calculation. The transfer policy and the location policy are combined in a stochastic algorithm. State information exchange is done via information feedback and mutual updating. Simulations demonstrate that our algorithm outperforms conventional approaches over a wide range of system parameters. These results are reconfirmed by empirical experiments after we have implemented the algorithms on the DJM global virtual machine.
- Yuk-Yin WONG, Kwong-Sak LEUNG, Kin-Hong LEE
- Computing Services Centre City University of Hong Kong, Tat Chee Avenue, KLN., Hong Kong; Department of Computer Science and Engineering Chinese University of Hong Kong, Shatin, N.T., Hong Kong
- clevin.wong@cityu.edu.hk
- Received August 14 2001, Comments to Authors February 9 2002, Accepted April 5 2002
C527: Performance Comparison of CORBA and Message Queuing under Diverse Network Constraints
- Abstract:Middleware provides standardised mechanisms that can be used to simplify the construction of reliable and flexible distributed applications over a network. Applying middleware to C3I problems can provide simple integration of disparate information systems in the military environment. Often, various middleware technologies are used as the communication infrastructure and as a practical ease to the network-programming problem. CORBA and message queuing are undoubtedly two major middleware paradigms that are fundamentally distinct from one another. In order to exploit middleware in the military environment, we need to understand the system characteristics of these middleware technologies A robust middleware infrastructure is a necessary underpinning for effective distributed collaboration across the range of stringent military networks. This paper presents a performance comparison between these two technologies, under typical test environments. We attempt to identify problems that arise from different network conditions in the interaction between client and server. Our experimental results provide guidance in assessing the feasibility of using different middleware paradigms in various military computing environments.
- T. Andrew Au
- Defence Science and Technology Organisation, Australia.
- Andrew.Au@dsto.defence.gov.au
- Received October 14 2001(electronic version), Comments to Authors February 9 2002
C528: Managing Application Complexity in the SAMRAI Object-Oriented Framework
- Abstract:A major challenge facing software libraries for scientific computing is the ability to provide adequate exibility to meet sophisticated, diverse, and evolving application requirements. Object-oriented design techniques are valuable tools for capturing characteristics of complex applications in a software architecture. In this paper, we describe certain prominent object-oriented features of the SAMRAI software library that have proven to be useful in application development. SAMRAI is used in a variety of applications and has demonstrated a substantial amount of code and design re-use in those applications. This flexibility and extensibility is illustrated with three different application codes. We emphasize two important features of our design. First, we describe the composition of complex numerical algorithms from smaller components which are usable in dfferent applications. Second, we discuss the extension of existing framework components to satisfy new application needs.
- Richard D. Hornung, Scott R. Kohn
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
- hornung@llnl.gov
- Received October 16 2001 (electronic version), Accepted for Special Issue
C529: Grid Services for Earthquake Science
- Abstract:We describe an information system architecture for the ACES (Asia-Pacific Cooperation for Earthquake Simulation) community. It addresses several key features of the field - simulations at multiple scales that need to be coupled together; real-time and archival observational data, which needs to be analyzed for patterns and linked to the simulations; a variety of important algorithms including partial differential equation solvers, particle dynamics, signal processing and data analysis; a natural three dimensional space (plus time) setting for both visualization and observations; the linkage of field to real-time events both as an aid to crisis management and to scientific discovery. We also address the need to support education and research for a field whose computational sophistication is increasing rapidly and spans a broad range. The information system assumes that all significant data is defined by an XML layer which could be virtual but whose existence ensures that all data is object-based and can be accessed and searched in this form. The various capabilities needed by ACES are defined as Grid Services, which are conformant with emerging standards and implemented with different levels of fidelity and performance appropriate for the application. Grid Services can be composed in a hierarchical fashion to address complex problems. The real-time needs of the field are addressed by high performance implementation of data transfer and simulation services; further the environment is linked to real-time collaboration to support interactions between scientists in geographically distant locations.
- Geoffrey Fox,Sung-Hoon Ko, Marlon Pierce, Ozgur Balsoy, Jake Kim, Sangmi Lee, Kangseok Kim, Sangyoon Oh, Xi Rao, Mustafa Varank, Hasan Bulut, Gurhan Gunduz, Xiaohong Qiu, Shrideep Pallickara, Ahmet Uyar, Choonhan Youn
- Florida State University, Indiana University, Syracuse University
- Marlon.Pierce@wpafb.af.mil
- Received October 3 2001, Comments to Authors October 20 2001, Accepted October 27 2001
C530: The Virtual Service Grid: An Architecture for Delivering High-End Network Services
- Abstract:This paper presents the design of a novel system architecture, Virtual Service Grid (VSG), for delivering high performance network services. The VSG is based on the concept of the Virtual Service which provides location, replication, and fault transparency to clients accessing remotely deployed high-end services. One of the novel features of the Virtual Service is the ability to self-scale in response to client demand. The VSG exploits network- and service-information to make adaptive dynamic replica selection, creation, and deletion decisions. We describe the VSG architecture, middleware, and replica management policies. We have deployed the VSG on a wide-area Internet testbed to evaluate its performance. The results indicate that the VSG can deliver efficient performance for a wide range of client workloads, both in terms of reduced response time, and in utilization of system resources.
- Jon B. Weissman and Byoung-Dai Lee
- Department of Computer Science and Engineering, University of Minnesota,
- jon@cs.umn.edu
- Received October 16 2001 (electronic version),Comments to Authors Dec 31 2001, Accepted January 20 2002
C575: Advanced Concurrency Control in Java
- Abstract:Developing concurrent applications is not a trivial task. As programs grow larger and become more complex, advanced concurrency control mechanisms are needed to ensure that application consistency is not compromised. Managing mutual exclusion on a per-object basis is not sufficient to guarantee isolation of sets of semantically-related actions. In this paper, we consider "atomic blocks", a simple and lightweight concurrency control paradigm that enables arbitrary blocks of code to access multiple shared objects in isolation. We evaluate various strategies for implementing atomic blocks in Java, in such a way that concurrency control is transparent to the programmer, isolation is preserved, and concurrency is maximized. We discuss these concurrency control strategies and evaluate them in terms of complexity and performance.
- Pascal A. Felber, Michael K. Reiter
- Bell Laboratories 600 Mountain Ave, Murray Hill, NJ 07974
- pascal@research.bell-labs.com
- Received October 1 2001, Comments to authors Dec 31 2001, accepted January 20 2002
C576: A Quality of Service-based Framework for Creating Distributed Heterogeneous Software Components
- Abstract:Component-based software development offers a promising solution for taming the complexity found in today's distributed applications. Today's and future software systems will certainly require combining heterogeneous software components that are geographically dispersed. For the successful deployment of such a software system, it is necessary that its realization, based on assembling heterogeneous components, not only meets the functional requirements, but also satisfies the non-functional criteria such as the desired QoS (quality of service). In this paper, a framework based on the notions of a meta-component model, a generative domain model and QoS parameters is described. A formal specification based on Two-level Grammar is used to represent these notions in a tightly integrated way so that QoS becomes a part of the generative domain model. A simple case study is described in the context of this framework.
- Rajeev R. Raje, Mikhail Augustom, Barrett R. Bryant, Andrew M. Olson, Carol Burt
- IUPUI, Naval Postgraduate School Monterey, University of Alabama at Birmingham, 2AB Inc. Alabama
- rraje@cs.iupui.edu
- Received October 8 2001, Comments to Authors March 18 2002, Accepted June 10 2002
C577: A comparison of concurrent programming and cooperative multithreading
- Abstract:This paper presents a comparison of the cooperative multithreading model with the general concurrent programming model. It focuses on the execution time performance of a range of standard concurrent programming applications. The overall results are mixed. In some cases, programs written in the cooperative multithreading model outperform those written in the general concurrent programming model. The contributions of this paper are twofold. First, it presents a thorough analysis of the performances of applications in the different models, i.e., to explain the criteria that determine when a program in one model will outperform an equivalent program in the other. Second, it examines the tradeoffs in writing programs in the different programming styles. In some cases, better performance comes at the cost of more complicated code.
- Aaron W. Keen, Takashi Ishihara, Justin T. Maris, Tiejun Li, Eugene F. Fodor, and Ronald A. Olsson
- Department of Computer Science, University of California, Davis, CA 95616 USA
- olsson@cs.ucdavis.edu
- Received October 10 2001, Comments to authors Dec 31 2001, accepted February 9 2002
C578: GridSim: A Toolkit for the Modeling and Simulation of Global Grids
- Abstract:Clusters, Grids, and Peer-to-Peer (P2P) networks have emerged as popular paradigms for next generation parallel and distributed computing. They enable aggregation of distributed resources for solving large-scale problems in science, engineering, and commerce. In Grid and P2P computing environments, the resources are usually geographically distributed in multiple administrative domains, managed and owned by different organizations with different policies, and interconnected by wide-area networks or the Internet. This introduces a number of resource management and application scheduling challenges in the domain of security, resource and policy heterogeneity, fault tolerance, continuously changing resource conditions, and politics. The resource management and scheduling systems for Grid computing need to manage resources and application execution depending on either resource consumers’ or owners’ requirements, and continuously adapting to changes in resource availability. The management resources and scheduling of applications in such a large-scale distributed systems is complex undertaking. In order to prove the effectiveness of resource brokers and associated scheduling algorithms, their performance need to evaluated under different scenarios such as varying number of resources and users with different requirements. In Grid environment, it is hard and even impossible to perform scheduler performance evaluation in a repeatable and controllable manner as resources and users are distributed across multiple organizations with their own policies. To overcome this limitation, we have proposed and developed a Java-based discrete-event Grid simulation toolkit called GridSim. The toolkit supports modeling and simulation of heterogeneous Grid resources (both time and space-shared), users and application models. It provides primitives for creation of application tasks, mapping of tasks to resources and their management. To demonstrate suitability of the GridSim toolkit, we have simulated a Nimrod-G like grid resource broker and evaluated the performance of deadline and budget constrained cost- and time minimization scheduling algorithms.
- Manzur Murshed, Rajkumar Buyya, and David Abramson
- Gippsland School of Computing and IT , Monash University, Gippsland Campus Churchill, Vic 3842, Australia; School of Computer Science and Software Engineering Monash University, C5.41, Caulfield Campus Melbourne, VIC 3145, Australia
- rajkumar@csse.monash.edu.au
- Received November 6 2001, Comments to authors Dec 31 2001, Accepted Feb 9 2002
C579: Experience with Memory Management in Open Linda Systems
- Abstract:Coordination systems, in particular Linda, have established themselves as important tools for the development of applications for open systems such as the Internet. This paper shows how to tackle a forgotten, but crucial problem in open coordination systems: memory management. As with any system which intends to be of wide use, coordination systems have to address the problems of memory exhaustion as memory is a finite resource. This paper first explores the separation between coordination and computation in order to make it clear that the problem of memory exhaustion in coordination systems cannot be solved using garbage collection schemes implemented at the computation language | a garbage collection scheme must exist in the coordination environment as well. As Linda is arguably the most successful coordination system, this paper will focus on the Linda family of systems. It is expected that the solution in Linda can be adapted to other coordination systems.
- Ronaldo Menezes
- Florida Institute of Technology Department of Computer Sciences 150 West University Blvd Melbourne, FL 32901
- rmenezes@cs.fit.edu
- Received November 21 2001, Comments to Authors March 26 2002, Accepted August 6 2002
C580: MPI-CHECK: a Tool for Checking Fortran 90 MPI Programs
- Abstract: MPI is commonly used to write parallel programs for distributed memory parallel computers. MPI-CHECK is a tool developed to aid in the debugging of MPI programs that are written in free or fixed format Fortran 90 and Fortran 77. MPI-CHECK provides automatic compile-time and run-time checking of MPI programs. MPI-CHECK automatically detects the following problems in the use of MPI routines: (i) mismatch in argument type, kind, rank or number; (ii) messages which exceed the bounds of the source/destination array; (iii) negative message lengths; (iv) illegal MPI calls before MPI_INITIALIZE or after MPI_FINALIZE; (v) inconsistencies between the declared type of a message and its associated DATATYPE argument; and (vi) actual arguments which violate the INTENT attribute.
- Glenn Luecke, Hua Chen, James Coyle, Jim Hoekstra, Marina Kraeva, Yan Zou
- Iowa State University
- grl@iastate.edu
- Received December 4 2001, Comments to Authors March 26 2002, Accepted April 18 2002
C581: The Design and Implementation of a Chinese Financial Invoice Recognition System
- Abstract:This paper designs and implements a financial invoice recognition system based on the features of the Chinese financial invoice. By using the linear whole block moving method in each vertical segment, a new fast algorithm is put forth to detect and rectify the slant image. To distinguish the different form types (the foundation necessary for locating the form fields, filtering the form lines, etc.), several representative form features are discussed and an invoice-type features library is built by using a semi-automatic machine study method. On the basis of the recognized invoice type, real invoice form is re-oriented against the corresponding blank form according to the invoice type feature, solving the problem of adhesion of characters and form lines, as well as the problem of characters segmentation and recognition. Based on the financial Chinese invoice image feature, a mutual rectification mechanism founded on the recognition results of financial Chinese characters and Arabic numerals is put forward to raise the recognition rate. Finally, the experimental results and conclusions are presented.
- MING Delie , LIU Jian , TIAN Jinwen
- State Key Laboratory for Image Processing and Intelligent Control Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, Hubei province, 430074, P.R.China Tel: 86-27-87556301
- mingdelie@263.net
- Received 17 December 2001, Comments to Authors March 26 2002
C582: The Virtual Laboratory: Enabling Molecular Modeling for Drug Design on the World Wide Grid
- Abstract:Computational Grids are emerging as a new paradigm for sharing and aggregation of geographically distributed resources for solving large-scale computational and data intensive problems in science, engineering, and commerce. However, application development, resource management and scheduling in these environments is a complex undertaking. In this paper, we illustrate the development of a virtual laboratory environment by leveraging existing Grid technologies to enable molecular modeling for drug design on geographically distributed resources. It involves screening millions of compounds in the chemical database (CDB) against a protein target to identify those with potential use for drug design. We have used the Nimrod-G parameter specification language for transforming a molecular docking application as a parameter sweep application for executing on distributed systems. We have developed new tools for data management to organize and provide access to chemical databases as a network service in a scalable and distributed manner. They enable transparent access to molecule records in the CDB of interest from remote clients. The Nimrod-G resource broker along with molecule CDB service and access management infrastructure is used for scheduling and on-demand processing of docking jobs on the World- Wide Grid (WWG) resources. The results demonstrate the ease of use and power of the Nimrod-G and virtual laboratory tools for grid computing.
- Rajkumar Buyya, Kim Branson, Jon Giddy, and David Abramson
- School of Computer Science and Software Engg. Monash University, Caulfield Campus Melbourne, Australia; Structural Biology, Walter and Eliza Hall Institute, Royal Parade, Parkville, Melbourne
- rajkumar@csse.monash.edu.au
- Received 19 December 2001, Comments to Authors March 5 2002, Accepted March 18 2002
C583: Automatic Refactoring by Simulation of Multiple Inheritance in Java
- Abstract:This paper shows a technique that enables the generation of proxy classes in an automatic manner. The model and implementation extends production programming to support fast and automatic prototyping of proxies and interfaces able to simulate multiple inheritance and refactor legacy systems, even when faced with missing source code. Advantages of the automatic synthesis of a proxy class include: compile-time type checking, speed of execution, automatic disambiguation (name space collision resolution) and ease of maintenance. The approach generates Java source that does method forwarding and creates interfaces as a means to achieve polymorphism. Disambiguation can be automatic, semi-automatic or manual. The forwarding code (i.e., proxy) evolves into an adapter as the delegates change their specification. This protects client classes from change. The interface and proxy code are generated automatically via reflection. This type of simulation of multiple inheritance was previously available only to Java programmers who performed manual delegation or who made use of dynamic proxies. The technique has been applied at a major aerospace corporation.
- Douglas Lyon
- Computer Engineering Dept. Fairfield University, Fairfield CT 06430
- lyon@docjava.com
- Received 24 December 2001, Comments to Authors March 26 2002, Accepted April 5 2002
C584: Workload Decomposition Strategies for Hierarchical Distributed-Shared Memory Parallel Systems and their Implementation with Integration of High Level Parallel Languages
- Abstract:In this paper we address the issue of workload decomposition in programming hierarchical distributed-shared memory parallel systems. The workload decomposition we have devised consists in a two-stage procedure: a higher-level decomposition among the computational nodes, and a lower-level one among the processors of each computational node. By focussing on porting of a case study PIC application, we have implemented the described work decomposition without large programming effort by using and integrating the high-levellanguages High Performance Fortran and OpenMP.
- Sergio Briguglio, Beniamino Di Martino, and Gregorio Vlad
- Associazione Euratom-ENEA sulla Fusione, C.R. Frascati, C.P. 65 -1-00044 Frascati, Rome, Italy; Dip. Ingegneria dell'Informazione, Second University of Naples, Italy
- briguglio@frascati.enea.it
- Received 18 November 2001, Comments to Authors March 26 2002, Accepted May 29 2002
C585: Optimizing Operating System Performance for CC-NUMA Architecture
- Abstract:CC-NUMA (Cache-Coherent Non-Uniform Memory Access) architecture is an attractive solution to scalable servers. The performance of a CC-NUMA system heavily depends on the number of accesses to remote memory through an interconnection network. To reduce the number of remote accesses, an operating system needs to exploit the potential locality of the architecture. This paper describes design and implementation of a UNIX-based operating system supporting CC-NUMA architecture. The operating system implements various enhancements by revising kernel algorithms and data structure. This paper also analyzes the performance of the enhanced operating system by running commercial benchmarks on a real CC-NUMA system. The performance analysis shows that the operating system can achieve better performance and scalability for CC-NUMA by kernel data striping, localization, and load balancing.
- Moon-Seok Chang
- lBM, MS 905-7C020, 11400 Burnet Rd., Austin, TX 78758, USA
- cmoon@us.ibm.com
- Received December 15 2001, Comments to Authors March 26 2002
C586: Implementation of the EARTH programming model on SMP clusters: a multi-threaded language and runtime system
- Abstract:We designed and implemented an EARTH (Efficient Architecture for Running THreads) runtime system for a multi-processor/multi-node cluster. For portability, we built this runtime system on top of Pthreads under Linux. This implementation enables the overlapping of communication and computation on a cluster of Symmetric Multi-Processors (SMP), and lets the interruptions generated by the arrival of new data drive the system, rather than relying on network polling. We describe how our implementation of a multi-threading model on a multi-processor/multi-node system arranges the execution and the synchronization activities to make the best use of the resources available, and how the interaction between the local processing and the network activities are organized. We also describe the EARTH programming model and its associated programming language, Threaded-C (release 2.0), used to write programs for this machine model. This programming model supports irregular fine-grain parallelism through a two-level hierarchy of threads and fibers.We introduce mutually exclusive fibers that enable the execution of explicitly threaded Threaded-C code on a multi-processor shared-memory system. One of our new synchronization mechanisms, atomic mailboxes, implements a non-deterministic merge operator.
- G. Tremblay, C.J. Morrone, J.N. Amaral, G.R. Gao
- Dept. d'informatique, Univ. du Quebec a Montreal, Montreal, QC, Canada; Computer Architecture and Parallel Systems Laboratory (CAPSL), Dept. of Electrical and Computer Engineering, Univ. of Delaware, Newark, DE, USA; Dept. of Computing Science, Univ. of Alberta, Edmonton, AB, Canada
- amaral@cs.ualberta.ca
- Received January 5 2002, Comments to Authors March 26 2002; Accepted July 11 2002
C587: The LINPACK Benchmark: Past, Present, and Future
- Abstract:In this paper we will clear up some of the confusion and mystery surrounding the LINPACK Benchmark and some of its variations. We will cover the original LINPACK 100 benchmark, the TOP500, the HPL program that can be used in the TOP500 measurement. We will examine what is measured and describe how to interpret the results of the programs' execution.
- Jack J. Dongarra, Piotr Luszczek and Antoine Petitet
- Innovative Computing Lab, Computer Science Dept, University of Tennessee, 1122 Volunteer Blvd, Knoxville TN, 37996-3450; Sun Microsystems, Inc., Paris, France
- dongarra@cs.utk.edu
- Received December 24 2001, Comments to Authors March 26 2002, Accepted July 2 2002

Submitted in 2000 or earlier(top)

Home Page for Java Grande 99 jg99papers/papersforjgsicpande.html
Home Page for Java Grande 2000 jg2000/jg2000index.html
C467: VGDS: A Distributed Data Structure Framework for Scientific Computation
- Abstract: This paper gives an overview of the VGDS (Virtual Global Data Structure) project. The VGDS effort focuses on developing an integrated, distributed environment that allows fast prototyping of a diverse set of simulation problems in irregular scientific and engineering domains, focusing on computations with irregular and adaptive structures. The framework defines two base libraries: unstructured mesh and adaptive tree, that capture major data structures involved inirregular scientic computation. The framework defines multiple layers of class libraries which work together to provide data-parallel representations to application developers while encapsulate parallel implementation details into lower layers of the framework. The layered approach enables easy extension of the base libraries to a variety of application-specific data structures. Experimental results on a network of workstations is reported.
- Pangfeng Liu, Jan-Jan Wu
- mailto:wuj@iis.sinica.edu.tw or mailto:pangfeng@cs.ccu.edu.tw
- Submitted April 21, 2000
- Comments To Authors October 1 2000
C468: cJVM: A Cluster JVM Architecture for Single System Image
- Abstract:cJVM is a Java Virtual Machine (JVM) which provides a single system image of a traditional JVM while executing in a distributed fashion on the nodes of a cluster. cJVM virtualizes the cluster, supporting any pure Java application without requiring that applications be tailored specifically for it. The aim of cJVM is to obtain improved scalability for a class of Java Server Applications by distributing the application's work among the cluster's computing resources. cJVM is based on a novel object model which distinguishes between an application's view of an object (e.g., every object is a unique data structure) and its implementation (e.g., objects may have consistent replications on different nodes). This enables us to exploit knowledge on the usage of individual objects to improve performance (e.g., using object replications to increase locality of access to objects). Currently, we have already completed a prototype which runs pure Java applications on a cluster of NT workstations connected via a Myrinet fast switch. The prototype provides a single system image to applications, distributing the application's threads and objects over the cluster. We have used cJVM to run without change a real Java Server Application containing over 10K loc and achieve high scalability for it on a cluster. We also achieved linear speedup for another application with a large number of independent threads. This paper discusses cJVM's architecture and implementation. It focuses on achieving a single system image for a traditional JVM on a cluster while describing in short how we aim at obtaining scalability.
- Yariv Aridor, Michael Factor and Avi Teperman
- mailto:teperman@il.ibm.com
- Submitted May 1, 2000
- Comments To Authors October 1 2000; Accepted November 23 2000
C474: Parallel solution of rotating flows in cavities
- Abstract:In this paper, we investigate the parallel solution to the rotating internal flow problems, using the Navier-Stokes equations as proposed in [16] and [15]. A Runge-Kutta time-stepping scheme was applied to the equations and both sequential and message- passing implementations were developed, the latter using MPI , and were tested on an SGI Origin200 distributed, global shared memory parallel computer. The results show that our approach to parallelize the sequential implementation requires little effort whilst providing good results even for medium-sized problems, on this particular computer.
- Rudnei Dias da Cunha and Alvaro Luiz de Bortoli
- mailto:rudnei@mat.ufrgs.br
- Submitted May 14, 2000
- Comments To Authors October 1 2000 and Accepted October 9 2000
C475: Analysis and Measurement of the Effect of Kernel Locks in SMP Systems
- Abstract:This paper reports the use of case studies to evaluate the performance degradation caused by the kernel-level lock. We define the lock ratio as a ratio of the execution time for critical sections to the total execution time of a parallel program. The kernel-level lock ratio determines how effective programs work on Symmetric MultiProcessor systems. We have measured the lock ratios and the performance of three types of parallel programs on SMP systems with Linux 2.0: matrix multiplication, parallel make, and WWW server programs. Experimental results show that the higher the lock ratio of parallel programs, the worse their performance becomes. keywords: SMP Systems, Operating Systems, Parallel Programs, Per- formance Evaluation, Kernel Lock
- Akihiro Kaieda and Yasuichi Nakayama; Atsuhiro Tanaka, Takashi Horikawa, Toshiyasu Kurasugi and Issei Kino
- mailto:yasu@cs.uec.ac.jp
- Submitted May 24, 2000
- Accepted October 1 2000
C476: Effective Multicast Programming in Large Scale Distributed Systems:The DACE Approach
- Abstract:Many distributed applications have a strong requirement for efficient dissemination of large amounts of information to widely spread consumers in large networks.These include applications in e-commerce and telecommunication.Publish/subscribe is considered one of the most important interaction styles to model communication at large scale.Producers publish information for a topic and consumers subscribe to the topics they wish to be informed of.The decoupling of producers and consumers in time and in space makes the publish/subscribe paradigm very attractive for large scale distribution,especially in environments like the Internet. This paper describes the architecture and implementation of DACE (Distributed Asynchronous Computing Environment),a framework for publish/subscribe communication based on an object- oriented programming abstraction in the form of Distributed Asynchronous Collection (DAC).DACs capture the different variations of publish/subscribe,without blurring their respective advantages. The architecture we present is tolerant to network partitions and crash failures.The underlying model is based on the notion of Topic Membership:a weak membership for the parties involved in a topic.We present how Topic Membership enables the realization of a robust and efficient reliable multicast for large scale.The protocol ensures that,inside a topic,even a subscriber that is temporarily partitioned away eventually receives a published message.
- Romain Boichat, Patrick Th. Eugster, Rachid Guerraoui, Joe Sventek
- Swiss Federal Institute of Technology, Lausanne and Agilent Laboratories Scotland, Edinburgh
- Email:Patrick.Eugster@lsesun6.epfl.ch
- Submitted July17, 2000
- Comments to Author November 23 2000
- Accepted February 3 2001
C486: Parallel Versions of Stone’s Strongly Implicit (SIP) Algorithm
- Abstract:In this paper, we describe various methods of deriving a parallel version of Stone’s Strongly Implicit Procedure (SIP) for solving sparse linear equations arising from finite difference approximation to partial differential equations (PDE’s). Sequential versions of this algorithm have been very successful in solving semi-conductor, heat conduction and flow simulation problems and an efficient parallel version would enable much larger simulations to be run. An initial investigation of various parallelising strategies was undertaken using a version of High Performance Fortran (HPF) and the best methods were reprogrammed using the MPI message passing libraries for increased efficiency. Early attempts concentrated on developing a parallel version of the characteristic wavefront computation pattern of the existing sequential SIP code. However, a red-black ordering of grid points, similar to that used in parallel versions of the Gauss-Seidel algorithm, is shown to be far more efficient. The results of both the wavefront and red-black MPI based algorithms are reported for various size problems and number of processors on a sixteen node IBM SP2.
- J.S. Reeve, A.D. Scurr and J.H. Merlin
- Department of Electronics and Computer Science, University of Southampton
- Email:jsr@ecs.soton.ac.uk
- Submitted August 24, 2000
- Comments to Authors February 4 2001, Accepted March 12 2001
C487: Real-time Multi-spectral Image Fusion
- Abstract:This paper describes a novel real-time multi-spectral imaging capability for surveillance applications. The capability combines a new high-performance multi-spectral camera system with a distributed algorithm that computes a spectral-screening Principal Component Transform (PCT). The camera system uses a novel filter wheel design together with a high-bandwidth CCD camera to allow image cubes to be delivered at 110 frames per second with spectral resolution between 400 and 1000 nm. The filters used in a particular application are selected to highlight a particular object based on its spectral signature. The distributed algorithm allows image streams from a dispersed collection of cameras to be disseminated, viewed, and interpreted by a distributed group of analysts in real-time. It operates on networks of commercial-off-the-shelf multiprocessors connected with high-performance (e.g. gigabit) networking, taking advantage of multi-threading where appropriate. The algorithm uses a concurrent formulation of the PCT to de-correlate and compress a multi-spectral image cube. Spectral screening is used to give features that occur infrequently (e.g. mechanized vehicles in a forest) equal importance to those that occur frequently (e.g. trees in the forest). A human-centered color-mapping scheme is used to maximize the impact of spectral contrast on the human visual system. To demonstrate the efficacy of the multi-spectral system, plant-life scenes with both real and artificial foliage are used. These scenes demonstrate the systems ability to distinguish elements of a scene, based on spectral contrast, that cannot be distinguished with the naked eye. The capability is evaluated in terms of visual performance, scalability, and real-time throughput. Our previous work on predictive analytical modeling is extended to answer practical design questions such as “For a specified cost, what system can should be constructed and what performance will it attain?”
- Tiranee Achalakul, Stephen Taylor
- Department Computer Science, Syracuse University
- Email:tachalak@syr.edu
- Submitted September 14, 2000
- Comments to Authors March 13 2001
C488: Efficient Communication Using Message Prediction for Cluster of Multiprocessors
- Abstract:With the increasing uniprocessor and SMP computation power available today,interprocessor communication has become an important factor that limits the performance of cluster of workstations.Many factors including communication hardware overhead,communication software overhead,and the user environment overhead (multithreading,multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying.Ideally,it is desirable to have a true zero-copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering.However due to the fact that message-passing applications at the send side do not know the final receive buffer addresses,early arrival messages have to be buffered at a temporary area.In this paper,we show that there is a message reception communication locality in message-passing applications.We have utilized this communication locality and devised different message predictors at the receiver sides of communications.In essence,these message predictors can be efficiently used to drain the network and cache the incoming messages even if the corresponding receive calls have not been posted yet.The performance of these predictors,in terms of hit ratio,on some parallel applications are quite promising and suggest that prediction has the potential to eliminate most of the remaining message copies .We also show that the proposed predictors do not have sensitivity to the starting message reception call,and that they perform better than (or at least equal to) our previously proposed predictors in [3 ].
- Ahmad Afsahi, Nikitas J.Dimopoulos
- Queens University and University of Victoria, Canada
- Email:ahmad@ee.queensu.ca
- Submitted September 14, 2000
- Comments to Authors March 13 2001, Accepted November 9 2001
C489: Implementing a Dynamic Processor Allocation Policy for Multiprogrammed Parallel Applications in the Solaris TM Operating System
- Abstract:Parallel applications typically do not perform well in a multiprogrammed envi- ronment that uses time-sharing to allocate processor resources to the applications' parallel threads. Coscheduling related parallel threads, or statically partitioning the system, often can reduce the applications' execution times, but at the expense of reducing the overall system utilization. To address this problem, there has been increasing interest in dynamically allocating processors to applications based on their resource demands and the dynamically varying system load. The Loop-Level Process Control (LLPC) policy [16] dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modied version of the Solaris operating system. The parallelism of the applications are then dynamically adjusted automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources.
- Kelvin K. Yue and David J. Lilja
- Sun Microsystems and University of Minnesota
- Email:lilja@ece.umn.edu
- Submitted October 10, 2000, Accepted February 10 2001
C490: Lesser Bear — a Light-weight Process Library for SMP computers —
- Abstract:We have designed and implemented a light-weight process (thread)li- brary called “Lesser Bear ”for SMP computers.Lesser Bear has high portability and thread-level parallelism.Creating UNIX processes as virtual processors and a memory-mapped .le as a huge shared-memory space enables Lesser Bear to execute threads in parallel. Lesser Bear requires exclusive operation between peer virtual processors,and treats a shared-memory space as a critical section for synchronization of threads. Therefore,thread functions of the previous Lesser Bear are serialized. In this paper,we present a scheduling mechanism to execute thread functions in parallel.In the design of the proposed mechanism,we divide the entire shared- memory space into partial spaces for virtual processors,and prepare two queues (Protect Queue and Waiver Queue)for each partial space.We adopt an algorithm in which ock operations are not necessary for enqueueing.This algorithm allows us to propose a scheduling mechanism that can reduce the scheduling overhead.The mechanism is applied to Lesser Bear and evaluated by experimental results.
- Hisashi Oguma and Yasuichi Nakayama
- The University of Electro-Communications, Tokyo
- Email:oguma-h@igo.cs.uec.ac.jp
- Submitted October 24, 2000
- Comments to Authors March 13 2001, Accepted April 23 2001
C492: PoLAPACK: Parallel Factorization Routines with Algorithmic Blocking
- Abstract:LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented onvector and parallel computers. Most of these factorization routines are implemented with block- partitioned algorithms in order to perform matrix-matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have dierent performance ratios of computation and com- munication, the optimal computational block sizes are dierent from one another to generate the maximumperformance of an algorithm. Therefore, the data matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes getting good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an "algorithmic blocking" on 2-dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines.
- Jaeyoung Choi
- School of Computing Soongsil University 1-1, Sangdo-Dong, Dongjak-Ku Seoul 156-743, KOREA
- Email:choi@comp.soongsil.ac.kr
- Submitted November 14, 2000
- Comments to Authors February 4 2001, accepted March 12 2001
C493: Space-Time Tradeoffs for Parallel 3D Reconstruction Algorithms for Virus Structure Determination
- Abstract:The 3D electron density determination of viruses, from experimental data provided by electron microscopy, is a data intensive computation that requires the use of clusters of PCs or parallel computers. In this paper we report on three parallel algorithms used for 3D recon- struction of asymmetric objects from their 2d projections. We discuss their computational, communication, I/O, and space requirements and present some performance data. The algorithms are general and can be used for 3D reconstruction of asymmetric objects for applications other than structural biology.
- Dan C. Marinescu, Yongchang Ji, and Robert E. Lynch
- Department of Computer Sciences Purdue University West Lafayette, IN 47907
- Email:[dcm, ji, rel]@cs.purdue.edu
- Submitted November 17, 2000
- Comments to Authors March 13 2001 Accepted April 8 2001
C494: A New Parallel Domain Decomposition Method for the Adaptive Finite Element Solution of Elliptic Partial Differential Equations
- Abstract:We present a new domain decomposition algorithm for the parallel finite element solution of elliptic partial differential equations. As with most parallel domain decomposition methods each processor is assigned one or more subdomains and an iteration is devised which allows the processors to solve their own subproblem(s) concurrently. The novel feature of this algorithm however is that each of these subproblems is defined over the entire domain -- although the vast majority of the degrees of freedom for each subproblem are associated with a single subdomain (owned by the corresponding processor). This ensures that a global mechanism is contained within each of the subproblems tackled and so no separate coarse grid solve is required in order to achieve rapid convergence of the overall iteration. Furthermore, by following the paradigm introduced in [5], it is demonstrated that this domain decomposition solver may be coupled easily with a conventional mesh refinement code, thus allowing the accuracy, reliability and efficiency of mesh adaptivity to be utilized in a well load-balanced manner. Finally, numerical evidence is presented which suggests that this technique has significant potential, both in terms of the rapid convergence properties and the efficiency of the parallel implementation.
- Randolph E. Bank and Peter K. Jimack
- Dept. of Mathematics, University of California at San Diego, La Jolla, CA 92093, USA.School of Computer Studies, University of Leeds, Leeds LS2 9JT, UK.
- Email: pkj@comp.leeds.ac.uk
- Recieved 19 May 2000, Accepted December 7 2000,
C495: Development and Performance Analysis of Real-World Applications for Distributed and Parallel Architectures
- Abstract:Several large real-world applications have been developed for distributed and parallel architectures. We examine two different program development approaches: First, the usage of a high-level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the cost of a reduced performance. A source-to-source compiler, has been employed to automatically compile programs – written in a high-level programming paradigm – into message passing codes. Second, manual program development by using a low-level programming paradigm – such as message passing – enables the programmer to fully exploit a given architecture at the cost of a time-consuming and error-prone effort. Performance tools play a central role to support the performance-oriented development of applications for distributed and parallel architectures. SCALA – a portable instrumentation, measurement, and post-execution performance analysis system for distributed and parallel programs – has been used to analyse and to guide the application development by selectively instrumenting and measuring the code versions, by comparing performance information of several program executions, by computing a variety of important performance metrics, by detecting performance bottlenecks, and by relating performance information back to the input program. We show several experiments of SCALA when applied to real-world applications. These experiments are conducted for a NEC Cenju-4 distributed memory machine and a cluster of heterogeneous workstations and networks.
- T. Fahringer, P.Blaha, A. Hossinger, J. Luitz, E. Mehofer, H. Moritsch, B. Scholz
- Institute for Software Science, University of Vienna Liechtensteinstrasse 22, A-1092, Vienna, Austria. Also Vienna University of Technology and Dept. of Business administration at University of Vienna
- Email: tf@par.univie.ac.at
- Recieved August 1 1999, Revised November 20 2000, Accepted December 18 2000,
C498: Simulating asynchronous hardware on multiprocessor platforms: the case of AMULET1
- Abstract:Synchronous VLSI design is approaching a critical point, with clock distribution becoming an increasingly costly and complicated issue and power consumption rapidly emerging as a major concern. Hence, recently, there has been a resurgence of interest in asynchronous digital design techniques which promise to liberate digital design from the inherent problems of synchronous systems. This activity has revealed a need for modelling and simulation techniques suitable for the asynchronous design style. The concurrent process algebra Communicating Sequential Processes (CSP) and its executable counterpart, occam, are increasingly advocated as particularly suitable for this purpose. This paper focuses on issues related to the execution of CSP/occam models of asynchronous hardware on multiprocessor machines, including I/O, monitoring and debugging, partition, mapping and load balancing. These issues are addressed in the context of occarm, an occam simulation model of the AMULET1 asynchronous microprocessor, however the solutions devised are more general and may be applied to other systems too.
- Georgios K. Theodoropoulos
- School of Computer Science, The University of Birmingham, Birmingham B15 2TT, U.K.
- Email: gkt@cs.bham.ac.uk
- Received Oct 18 2000, Accepted February 22 2001