Concurrency and Computation:Practice and Experience

Here are instructions for Grid Computing Environment special Issue due Mid July 2001.
Here are instructions for Java Grande/ISCOPE Special Issue due August 31 2001

This homepage supports referees.

Standard Referee Form

Journal Vision

WILEY Journal Home Page

Editorial Board

Useful CiteSeer Citation Finder from NEC Research

Below you will find active abstracts.

Please send email to Geoffrey Fox, fox@csit.fsu.edu, if you wish to referee any article.

We will send you link to full text online.

Articles under Consideration for Journal

C467: VGDS: A Distributed Data Structure Framework for Scientific Computation
- Abstract: This paper gives an overview of the VGDS (Virtual Global Data Structure) project. The VGDS effort focuses on developing an integrated, distributed environment that allows fast prototyping of a diverse set of simulation problems in irregular scientific and engineering domains, focusing on computations with irregular and adaptive structures. The framework defines two base libraries: unstructured mesh and adaptive tree, that capture major data structures involved inirregular scientic computation. The framework defines multiple layers of class libraries which work together to provide data-parallel representations to application developers while encapsulate parallel implementation details into lower layers of the framework. The layered approach enables easy extension of the base libraries to a variety of application-specific data structures. Experimental results on a network of workstations is reported.
- Pangfeng Liu, Jan-Jan Wu
- mailto:wuj@iis.sinica.edu.tw mailto:pangfeng@cs.ccu.edu.tw
- Submitted April 21, 2000
- Comments To Authors October 1 2000
C468: cJVM: A Cluster JVM Architecture for Single System Image
- Abstract:cJVM is a Java Virtual Machine (JVM) which provides a single system image of a traditional JVM while executing in a distributed fashion on the nodes of a cluster. cJVM virtualizes the cluster, supporting any pure Java application without requiring that applications be tailored specifically for it. The aim of cJVM is to obtain improved scalability for a class of Java Server Applications by distributing the application's work among the cluster's computing resources. cJVM is based on a novel object model which distinguishes between an application's view of an object (e.g., every object is a unique data structure) and its implementation (e.g., objects may have consistent replications on different nodes). This enables us to exploit knowledge on the usage of individual objects to improve performance (e.g., using object replications to increase locality of access to objects). Currently, we have already completed a prototype which runs pure Java applications on a cluster of NT workstations connected via a Myrinet fast switch. The prototype provides a single system image to applications, distributing the application's threads and objects over the cluster. We have used cJVM to run without change a real Java Server Application containing over 10K loc and achieve high scalability for it on a cluster. We also achieved linear speedup for another application with a large number of independent threads. This paper discusses cJVM's architecture and implementation. It focuses on achieving a single system image for a traditional JVM on a cluster while describing in short how we aim at obtaining scalability.
- Yariv Aridor, Michael Factor and Avi Teperman
- mailto:teperman@il.ibm.com
- Submitted May 1, 2000
- Comments To Authors October 1 2000; Accepted November 23 2000
C474: Parallel solution of rotating flows in cavities
- Abstract:In this paper, we investigate the parallel solution to the rotating internal flow problems, using the Navier-Stokes equations as proposed in [16] and [15]. A Runge-Kutta time-stepping scheme was applied to the equations and both sequential and message- passing implementations were developed, the latter using MPI , and were tested on an SGI Origin200 distributed, global shared memory parallel computer. The results show that our approach to parallelize the sequential implementation requires little effort whilst providing good results even for medium-sized problems, on this particular computer.
- Rudnei Dias da Cunha and Alvaro Luiz de Bortoli
- mailto:rudnei@mat.ufrgs.br
- Submitted May 14, 2000
- Comments To Authors October 1 2000 and Accepted October 9 2000
C475: Analysis and Measurement of the Effect of Kernel Locks in SMP Systems
- Abstract:This paper reports the use of case studies to evaluate the performance degradation caused by the kernel-level lock. We define the lock ratio as a ratio of the execution time for critical sections to the total execution time of a parallel program. The kernel-level lock ratio determines how effective programs work on Symmetric MultiProcessor systems. We have measured the lock ratios and the performance of three types of parallel programs on SMP systems with Linux 2.0: matrix multiplication, parallel make, and WWW server programs. Experimental results show that the higher the lock ratio of parallel programs, the worse their performance becomes. keywords: SMP Systems, Operating Systems, Parallel Programs, Per- formance Evaluation, Kernel Lock
- Akihiro Kaieda and Yasuichi Nakayama; Atsuhiro Tanaka, Takashi Horikawa, Toshiyasu Kurasugi and Issei Kino
- mailto:yasu@cs.uec.ac.jp
- Submitted May 24, 2000
- Accepted October 1 2000
C476: Effective Multicast Programming in Large Scale Distributed Systems:The DACE Approach
- Abstract:Many distributed applications have a strong requirement for efficient dissemination of large amounts of information to widely spread consumers in large networks.These include applications in e-commerce and telecommunication.Publish/subscribe is considered one of the most important interaction styles to model communication at large scale.Producers publish information for a topic and consumers subscribe to the topics they wish to be informed of.The decoupling of producers and consumers in time and in space makes the publish/subscribe paradigm very attractive for large scale distribution,especially in environments like the Internet. This paper describes the architecture and implementation of DACE (Distributed Asynchronous Computing Environment),a framework for publish/subscribe communication based on an object- oriented programming abstraction in the form of Distributed Asynchronous Collection (DAC).DACs capture the different variations of publish/subscribe,without blurring their respective advantages. The architecture we present is tolerant to network partitions and crash failures.The underlying model is based on the notion of Topic Membership:a weak membership for the parties involved in a topic.We present how Topic Membership enables the realization of a robust and efficient reliable multicast for large scale.The protocol ensures that,inside a topic,even a subscriber that is temporarily partitioned away eventually receives a published message.
- Romain Boichat, Patrick Th. Eugster, Rachid Guerraoui, Joe Sventek
- Swiss Federal Institute of Technology, Lausanne and Agilent Laboratories Scotland, Edinburgh
- Email:Patrick.Eugster@lsesun6.epfl.ch
- Submitted July17, 2000
- Comments to Author November 23 2000
- Accepted February 3 2001
C486: Parallel Versions of Stone’s Strongly Implicit (SIP) Algorithm
- Abstract:In this paper, we describe various methods of deriving a parallel version of Stone’s Strongly Implicit Procedure (SIP) for solving sparse linear equations arising from finite difference approximation to partial differential equations (PDE’s). Sequential versions of this algorithm have been very successful in solving semi-conductor, heat conduction and flow simulation problems and an efficient parallel version would enable much larger simulations to be run. An initial investigation of various parallelising strategies was undertaken using a version of High Performance Fortran (HPF) and the best methods were reprogrammed using the MPI message passing libraries for increased efficiency. Early attempts concentrated on developing a parallel version of the characteristic wavefront computation pattern of the existing sequential SIP code. However, a red-black ordering of grid points, similar to that used in parallel versions of the Gauss-Seidel algorithm, is shown to be far more efficient. The results of both the wavefront and red-black MPI based algorithms are reported for various size problems and number of processors on a sixteen node IBM SP2.
- J.S. Reeve, A.D. Scurr and J.H. Merlin
- Department of Electronics and Computer Science, University of Southampton
- Email:jsr@ecs.soton.ac.uk
- Submitted August 24, 2000
- Comments to Authors February 4 2001, Accepted March 12 2001
C487: Real-time Multi-spectral Image Fusion
- Abstract:This paper describes a novel real-time multi-spectral imaging capability for surveillance applications. The capability combines a new high-performance multi-spectral camera system with a distributed algorithm that computes a spectral-screening Principal Component Transform (PCT). The camera system uses a novel filter wheel design together with a high-bandwidth CCD camera to allow image cubes to be delivered at 110 frames per second with spectral resolution between 400 and 1000 nm. The filters used in a particular application are selected to highlight a particular object based on its spectral signature. The distributed algorithm allows image streams from a dispersed collection of cameras to be disseminated, viewed, and interpreted by a distributed group of analysts in real-time. It operates on networks of commercial-off-the-shelf multiprocessors connected with high-performance (e.g. gigabit) networking, taking advantage of multi-threading where appropriate. The algorithm uses a concurrent formulation of the PCT to de-correlate and compress a multi-spectral image cube. Spectral screening is used to give features that occur infrequently (e.g. mechanized vehicles in a forest) equal importance to those that occur frequently (e.g. trees in the forest). A human-centered color-mapping scheme is used to maximize the impact of spectral contrast on the human visual system. To demonstrate the efficacy of the multi-spectral system, plant-life scenes with both real and artificial foliage are used. These scenes demonstrate the systems ability to distinguish elements of a scene, based on spectral contrast, that cannot be distinguished with the naked eye. The capability is evaluated in terms of visual performance, scalability, and real-time throughput. Our previous work on predictive analytical modeling is extended to answer practical design questions such as “For a specified cost, what system can should be constructed and what performance will it attain?”
- Tiranee Achalakul, Stephen Taylor
- Department Computer Science, Syracuse University
- Email:tachalak@syr.edu
- Submitted September 14, 2000
- Comments to Authors March 13 2001
C488: Efficient Communication Using Message Prediction for Cluster of Multiprocessors
- Abstract:With the increasing uniprocessor and SMP computation power available today,interprocessor communication has become an important factor that limits the performance of cluster of workstations.Many factors including communication hardware overhead,communication software overhead,and the user environment overhead (multithreading,multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying.Ideally,it is desirable to have a true zero-copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering.However due to the fact that message-passing applications at the send side do not know the final receive buffer addresses,early arrival messages have to be buffered at a temporary area.In this paper,we show that there is a message reception communication locality in message-passing applications.We have utilized this communication locality and devised different message predictors at the receiver sides of communications.In essence,these message predictors can be efficiently used to drain the network and cache the incoming messages even if the corresponding receive calls have not been posted yet.The performance of these predictors,in terms of hit ratio,on some parallel applications are quite promising and suggest that prediction has the potential to eliminate most of the remaining message copies .We also show that the proposed predictors do not have sensitivity to the starting message reception call,and that they perform better than (or at least equal to) our previously proposed predictors in [3 ].
- Ahmad Afsahi, Nikitas J.Dimopoulos
- Queens University and University of Victoria, Canada
- Email:ahmad@ee.queensu.ca
- Submitted September 14, 2000
- Comments to Authors March 13 2001
C489: Implementing a Dynamic Processor Allocation Policy for Multiprogrammed Parallel Applications in the Solaris TM Operating System
- Abstract:Parallel applications typically do not perform well in a multiprogrammed envi- ronment that uses time-sharing to allocate processor resources to the applications' parallel threads. Coscheduling related parallel threads, or statically partitioning the system, often can reduce the applications' execution times, but at the expense of reducing the overall system utilization. To address this problem, there has been increasing interest in dynamically allocating processors to applications based on their resource demands and the dynamically varying system load. The Loop-Level Process Control (LLPC) policy [16] dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modied version of the Solaris operating system. The parallelism of the applications are then dynamically adjusted automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources.
- Kelvin K. Yue and David J. Lilja
- Sun Microsystems and University of Minnesota
- Email:lilja@ece.umn.edu
- Submitted October 10, 2000, Accepted February 10 2001
C490: Lesser Bear — a Light-weight Process Library for SMP computers —
- Abstract:We have designed and implemented a light-weight process (thread)li- brary called “Lesser Bear ”for SMP computers.Lesser Bear has high portability and thread-level parallelism.Creating UNIX processes as virtual processors and a memory-mapped .le as a huge shared-memory space enables Lesser Bear to execute threads in parallel. Lesser Bear requires exclusive operation between peer virtual processors,and treats a shared-memory space as a critical section for synchronization of threads. Therefore,thread functions of the previous Lesser Bear are serialized. In this paper,we present a scheduling mechanism to execute thread functions in parallel.In the design of the proposed mechanism,we divide the entire shared- memory space into partial spaces for virtual processors,and prepare two queues (Protect Queue and Waiver Queue)for each partial space.We adopt an algorithm in which ock operations are not necessary for enqueueing.This algorithm allows us to propose a scheduling mechanism that can reduce the scheduling overhead.The mechanism is applied to Lesser Bear and evaluated by experimental results.
- Hisashi Oguma and Yasuichi Nakayama
- The University of Electro-Communications, Tokyo
- Email:oguma-h@igo.cs.uec.ac.jp
- Submitted October 24, 2000
- Comments to Authors March 13 2001, Accepted April 23 2001
C492: PoLAPACK: Parallel Factorization Routines with Algorithmic Blocking
- Abstract:LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented onvector and parallel computers. Most of these factorization routines are implemented with block- partitioned algorithms in order to perform matrix-matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have dierent performance ratios of computation and com- munication, the optimal computational block sizes are dierent from one another to generate the maximumperformance of an algorithm. Therefore, the data matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes getting good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an "algorithmic blocking" on 2-dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines.
- Jaeyoung Choi
- School of Computing Soongsil University 1-1, Sangdo-Dong, Dongjak-Ku Seoul 156-743, KOREA
- Email:choi@comp.soongsil.ac.kr
- Submitted November 14, 2000
- Comments to Authors February 4 2001, accepted March 12 2001
C493: Space-Time Tradeoffs for Parallel 3D Reconstruction Algorithms for Virus Structure Determination
- Abstract:The 3D electron density determination of viruses, from experimental data provided by electron microscopy, is a data intensive computation that requires the use of clusters of PCs or parallel computers. In this paper we report on three parallel algorithms used for 3D recon- struction of asymmetric objects from their 2d projections. We discuss their computational, communication, I/O, and space requirements and present some performance data. The algorithms are general and can be used for 3D reconstruction of asymmetric objects for applications other than structural biology.
- Dan C. Marinescu, Yongchang Ji, and Robert E. Lynch
- Department of Computer Sciences Purdue University West Lafayette, IN 47907
- Email:[dcm, ji, rel]@cs.purdue.edu
- Submitted November 17, 2000
- Comments to Authors March 13 2001 Accepted April 8 2001
C494: A New Parallel Domain Decomposition Method for the Adaptive Finite Element Solution of Elliptic Partial Differential Equations
- Abstract:We present a new domain decomposition algorithm for the parallel finite element solution of elliptic partial differential equations. As with most parallel domain decomposition methods each processor is assigned one or more subdomains and an iteration is devised which allows the processors to solve their own subproblem(s) concurrently. The novel feature of this algorithm however is that each of these subproblems is defined over the entire domain -- although the vast majority of the degrees of freedom for each subproblem are associated with a single subdomain (owned by the corresponding processor). This ensures that a global mechanism is contained within each of the subproblems tackled and so no separate coarse grid solve is required in order to achieve rapid convergence of the overall iteration. Furthermore, by following the paradigm introduced in [5], it is demonstrated that this domain decomposition solver may be coupled easily with a conventional mesh refinement code, thus allowing the accuracy, reliability and efficiency of mesh adaptivity to be utilized in a well load-balanced manner. Finally, numerical evidence is presented which suggests that this technique has significant potential, both in terms of the rapid convergence properties and the efficiency of the parallel implementation.
- Randolph E. Bank and Peter K. Jimack
- Dept. of Mathematics, University of California at San Diego, La Jolla, CA 92093, USA.School of Computer Studies, University of Leeds, Leeds LS2 9JT, UK.
- Email: pkj@comp.leeds.ac.uk
- Recieved 19 May 2000, Accepted December 7 2000,
C495: Development and Performance Analysis of Real-World Applications for Distributed and Parallel Architectures
- Abstract:Several large real-world applications have been developed for distributed and parallel architectures. We examine two different program development approaches: First, the usage of a high-level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the cost of a reduced performance. A source-to-source compiler, has been employed to automatically compile programs – written in a high-level programming paradigm – into message passing codes. Second, manual program development by using a low-level programming paradigm – such as message passing – enables the programmer to fully exploit a given architecture at the cost of a time-consuming and error-prone effort. Performance tools play a central role to support the performance-oriented development of applications for distributed and parallel architectures. SCALA – a portable instrumentation, measurement, and post-execution performance analysis system for distributed and parallel programs – has been used to analyse and to guide the application development by selectively instrumenting and measuring the code versions, by comparing performance information of several program executions, by computing a variety of important performance metrics, by detecting performance bottlenecks, and by relating performance information back to the input program. We show several experiments of SCALA when applied to real-world applications. These experiments are conducted for a NEC Cenju-4 distributed memory machine and a cluster of heterogeneous workstations and networks.
- T. Fahringer, P.Blaha, A. Hossinger, J. Luitz, E. Mehofer, H. Moritsch, B. Scholz
- Institute for Software Science, University of Vienna Liechtensteinstrasse 22, A-1092, Vienna, Austria. Also Vienna University of Technology and Dept. of Business administration at University of Vienna
- Email: tf@par.univie.ac.at
- Recieved August 1 1999, Revised November 20 2000, Accepted December 18 2000,
C496: Open Consensus
- Abstract:This paper presents the abstraction of open consensus and argues for its use as an effective component for building reliable agreement protocols in practical asynchronous systems where processes and links can crash and recover.The specification of open consensus has a decoupled on-demand and re-entrant behaviour that make its use very efficient,especially in terms of forced logs, which are known to be major sources of overhead in distributed systems.We illustrate the use of open consensus as a basic building block to develop a modular, yet efficient, total order broadcast protocol.Finally, we describe our Java implementation of our open consensus abstraction and we convey our efficiency claims through some practical performance measures.
- Romain Boichat, Svend Frølund, Rachid Guerraoui
- Swiss Federal Institute of Technology, CH 1015, Lausanne and Hewlett-Packard Laboratories, 1501 Page Mill Rd, Palo Alto
- Email: Romain.Boichat@epfl.ch
- Received January 17 2001, Comments to Authors May 8 2001, Accepted May 19 2001
C497: Java for High-Performance Computing
- Abstract:There has been an increasing research interest in extending the use of Java towards high-performance demanding applications such as scalable web servers, multimedia applications, and large-scale scientific applications. However, given the low performance of current Java implementations, these application domains pose new challenges to both the application designer and systems developer. In this paper we describe and classify several important proposals and environments that tackle Java's performance bottlenecks in order to make the language an effective option for high-performance computing. We further survey most significant performance issues while exposing the potential benefits and limitations of current solutions in such a way that a framework for future research efforts can be established. We show that most of the proposed solutions can be classified according to some combination of the three basic parameters: the model adopted for inter-process communication, language extensions, and the implementation strategy. In addition, we examine other relevant issues, such as interoperability, portability, and garbage collection.
- M. Lobosco, C. Amorim, and O. Loques
- Instituto de Computação Universidade Federal Fluminense Niterói, Rio de Janeiro, Brazil
- Email: loques@ic.uff.br
- Received January 15 2001, Comments to authors May 28 2001
C498: Simulating asynchronous hardware on multiprocessor platforms: the case of AMULET1
- Abstract:Synchronous VLSI design is approaching a critical point, with clock distribution becoming an increasingly costly and complicated issue and power consumption rapidly emerging as a major concern. Hence, recently, there has been a resurgence of interest in asynchronous digital design techniques which promise to liberate digital design from the inherent problems of synchronous systems. This activity has revealed a need for modelling and simulation techniques suitable for the asynchronous design style. The concurrent process algebra Communicating Sequential Processes (CSP) and its executable counterpart, occam, are increasingly advocated as particularly suitable for this purpose. This paper focuses on issues related to the execution of CSP/occam models of asynchronous hardware on multiprocessor machines, including I/O, monitoring and debugging, partition, mapping and load balancing. These issues are addressed in the context of occarm, an occam simulation model of the AMULET1 asynchronous microprocessor, however the solutions devised are more general and may be applied to other systems too.
- Georgios K. Theodoropoulos
- School of Computer Science, The University of Birmingham, Birmingham B15 2TT, U.K.
- Email: gkt@cs.bham.ac.uk
- Received Oct 18 2000, Accepted February 22 2001
C499: Data Movement and Control Substrate for Parallel Adaptive Applications
- Abstract:In this paper, we present the Data Movement and Control Substrate (DMCS), that implements one-sided low-latency communication primitives for parallel adaptive and irregular computations. DMCS is built on top of low-level, vendor-specific communication subsystems such as LAPI for IBM SP machines and widely available libraries like MPI for clusters of workstations and PCs. DMCS adds a small overhead to the communication operations provided by the lower communication system. In return DMCS provides a flexible and easy to understand application program interface for one-sided communication operations including put-ops and remote procedure calls. Furthermore, DMCS is designed so that it can be easily ported and maintained by non-experts.
- Jeffrey Dobbelaere, Kevin Barker, Nikos Chrisochoides, and Demian Naves, Keshav Pingali
- Computer Science Department College of William & Mary Williamsburg, VA 23185.
- Computer Science Department Cornell University Ithaca, NY 14853-3801
- Email: nikos@plato.cs.wm.edu
- Received 11 March 2001
C500: (ACES Special Issue)Parallel visualization of gigabyte datasets in GeoFEM
- Abstract:Parallel visualization of large datasets in GeoFEM is described. Our visualization subsystem supports concurrent visualization with computation, and outputs a simplified small graphic primitive set, which can be displayed by a basic viewer on clients. This subsystem provides many kinds of parallel visualization algorithms for the users to visualize their data from scalar, vector to tensor. Scalar field topology analysis, flow field topology and semi-automatic parameter design are employed to improve the quality of visualization results. We also present a simplification algorithm to reduce the number of output graphic primitives, accounting for both shape attribute and color attribute. The experimental results show the effectiveness of our subsystem.
- Issei Fujishiro, Yuriko Takeshima, Li Chen, Hiroko Nakamura , and Yasuko Suzuki
- Ochanomizu University, Tokyo, Japan
- Research Organization for Information Science & Technology, Tokyo, Japan
- State Key Lab. of CAD&CG, Zhejiang University, Hangzhou, 310027, P.R. China.
- Email: chen@tokyo.rist.or.jp
- Received 26 February 2001
C501: (ACES Special Issue)Parallelization of a Large-Scale Computational Earthquake Simulation Program
- Abstract:Here we detail both the methods and preliminary results of first efforts to parallelize two General Earthquake Model (GEM)-related codes: 1) a relatively simple data mining procedure based on a Genetic Algorithm; and 2) the Virtual California simulation of GEM . These preliminary results, using a simple, heterogeneous processor system, existing freeware, and with an extremely low cost of both manpower and hardware dollars, motivate us to more ambitious work with considerably larger-scale computer earthquake simulations of southern California. The GEM computational problem, which is essentially a Monte Carlo simulation, is well suited to optimization on parallel computers, and we outline how we are proceeding in implementing this new software architecture.
- K.F. Tiampo , J.B. Rundle , S. Gross and S. McGinnis
- Colorado Center for Chaos & Complexity, CIRES, University of Colorado, Boulder, CO, 80309, USA
- Email: kristy@caldera.colorado.edu
- Received 23 February 2001
C502: (ACES Special Issue)Parallel Iterative Solvers for Unstructured Grids using Directive/MPI Hybrid Programming Model for GeoFEM Platform on SMP Cluster Architectures
- Abstract:In this study, efficient parallel iterative method for unstructured grid has been developed for SMP (shared memory symmetric multiprocessor) cluster architectures on GeoFEM platform. 3-level hybrid parallel programming model has been applied : 1. message passing for inter SMP node communication, 2. loop directive for intra SMP node parallelization and 3. vectorization for each PE (processing element). Simple 3D elastic linear problems with more than 10 8 DOFs have been solved by 3x3 block ICCG(0) with additive Schwartz domain decomposition and PDJDS/CM-RCM ((Parallel Descending-order Jagged Diagonal Storage/ Cyclic Multicolor-Reverse Cuthil Mckee)) re-ordering on 16 SMP nodes of Hitachi SR8000 and 20 GFLOPS performance has been obtained. PDJDS/CM-RCM reordering method provides excellent vector and parallel performance in SMP nodes. This re-ordering is essential for parallelization of forward/backward substitution in IC/ILU factorization with global data dependency. Developed method was also tested on NEC SX-4 and attained 969 MFLOPS (48.5% of peak performance) using single processor. Additive Schwartz domain decomposition method provided robustness to the GeoFEM's parallel iterative solvers with localized preconditioning.
- Kengo Nakajima and Hiroshi Okuda
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Department of Quantum Engineering and Systems Science, The University of Tokyo, Tokyo, Japan
- Email: nakajima@tokyo.rist.or.jp
- Received 27 February 2001
C503: (ACES Special Issue)Finite Element Modeling of Multibody Contact and Its Application to Active Faults
- Abstract:Earthquakes have been recognized as resulting from a stick-slip frictional instability along the faults between deformable rocks. An arbitrarily shaped contact element strategy, named as node-to-point contact element strategy is proposed to handle the friction contact between deformable bodies with stick and finite frictional slip and extended to simulate the active faults in the crust with a more general nonlinear friction law. Also introduced is an efficient contact search algorithm for contact problems among multiple small and finite deformation bodies. Moreover, the efficiency of the parallel sparse solver for the nonlinear friction contact problem is investigated. Finally, a model for the plate movement in the northeast zone of Japan under gravitation is taken as an example to be analyzed with different friction behaviors.
- H.L. Xing and A. Makinouchi
- Materials Fabrication Lab, The Institute of Physical and Chemical Research (RIKEN) 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
- Email: xing@postman.riken.go.jp
- Received 28 February 2001
C504: (ACES Special Issue)Effective Adaptation Technique for Hexahedral Mesh
- Abstract:This paper describes simple and effective adaptive techniques for the finite element method.The proposed refinement method is based on a modified octree divsion. This modified octree-based method can be applied to an arbitary hexahedral unstructured mesh. Hex-R is the the name of our implentation and it refines the mesh in cooperation with a finite element viewer: GPPView. The refined mesh retains the topology of the original coarse mesh. Therefore one can easily use the proposed method for multigrid generation. Finally the construction of an adaptive method using Hex-R is mentioned.
- Yoshitaka Wada
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Email: wada@tokyo.rist.or.jp
- Received March 2 2001
C505: (ACES Special Issue)Thermal Convection Analysis in a Rotating Shell by a Parallel FEM - Development of a Thermal-Hydraulic Subsystem of GeoFEM
- Abstract:We carry out a numerical simulation of thermally driven convection in a rotating spherical shell modeled on the Earth ’s outer core using Thermal-Hydraulic subsystem of GeoFEM,which gives a parallel FEM platform.This simulation is for understanding of the origin of the geomagnetic field and dynamics of a fluid in the Earth’s outer core.We solve a three-dimensional and time dependent process of a Boussinesq fluid in a rotating spherical shell under effects of self gravity and the Coriolis force.In this simulation,tri-linear hexahedron element is applied for spatial distribution,and 2nd-order Adams-Bashforth scheme are applied for time integration of the temperature and velocity.To satisfy the mass conservation,a parallel iterative solver of GeoFEM is applied for solving pressure and correction of the velocity fields.To verify our simulation code,results of the simulation are compared with the same analysis by the spectral method.In these results,outline of convection is almost same in these cases;that is,three pairs of convection columns are formed,and these columns propagate to the westward in quasi-steady state.However, magnitude of kinetic energy averaged over the shell is about 93%of that in the case by the spectral method,and drift frequency of the columns in the GeoFEM case is larger than that in the case by the spectral method.
- Hiroaki Matsui and Hiroshi Okuda
- Department of Research for Computational Earth Science, Research Organization for Information Science & Technology (RIST), Tokyo
- Department of Quantum Engineering and System Science, The University of Tokyo, Tokyo, Japan
- Email: matsui@tokyo.rist.or.jp
- Received March 3 2001
C506: (ACES Special Issue)Performance Optimization of GeoFEM on Various Computer Architecture
- Abstract:We have a prospect that Geofem get good parallel performance on various computer architecture by the research which has been made in GeoFEM team. In this research, we focus a target to performance with a single processor, and common data structure / coding manner to make GeoFEM running high performance on various computer architecture was researched.
  Test coding for Solver Part of the structure and the fluid analysis code Data structure and coding manner was evaluated on scalar/vector/pseudo vector architecture. A new data structure and a new direct access manner was introduced in the fluid analysis solver. As a result, 21% performance for peak performance was obtained on pseudo vector architecture. And We got as good performance as the pseudo-vector execution performance in scalar execution. 25% performance for peak performance was obtained on vector architecture. A new direct access manner was introduced in the structure code solver. As a result, 27% performance for peak performance was obtained on pseudo vector architecture. 34% performance for peak performance was obtained on vector architecture.
  
  Test Code for Matrix Assemble Part of the structure analysis code Coding of removing dependency was finished and performance was evaluated on vector/scalar machine. 736.8MFlops was obtained at matrix assembling process on SX-4. 900.7MFlops was obtained at whole test Code on SX-4, and 2.06GFlops was obtained on VPP5000 (peak : 9.6GFlops). 124MFlops was obtained at matrix assembling process on Alpha system (Alpha 21164 533MHz).
- Kazuo Minami
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Email: minami@tokyo.rist.or.jp
- Received March 5 2001
C507: (ACES Special Issue)Parallel Multilevel Iterative Linear Solvers with Unstructured Adaptive Grids for Simulations in Earth Science
- Abstract:Preconditioned iterative solver is one of the most powerful choice such as IC (Incomplete Cholesky) or ILU (Incomplete LU) factorization method for large-scale scientific computation. But in these methods, iteration number until convergence increases as the problem size becomes larger. In multigrid solvers, the rate of convergence is independent of problem size and the number of iterations remains fairly constant. Multigrid is also a good preconditioner for Krylov iterative solvers. In this study, multigrid preconditioned conjugate gradient iterative method (MGCG) on parallel computers has been developed and applied to the Poisson equation in the region between dual sphere surfaces on semi-unstructured prismatic grids generated adaptively. Moreover this procedure has been applied to the grids with local refinement. Computational results on Hitachi SR2201 using up to 128 processors show good scalability of the developed method.
- Kengo Nakajima
- Department of Computational Earth Sciences, Research Organization for Information Science and Technology (RIST), Tokyo, Japan
- Email: nakajima@tokyo.rist.or.jp
- Received 19 March 2001
C508: Comparing Windows NT, Linux, and QNX as the Basis for Cluster Systems
- Abstract:Clusters use commodity hardware and software components to provide an environment for parallel processing. A major issue in the development of a cluster system is the choice of the operating system that will run on each node. We compare three alternatives: Windows NT, Linux, and QNX — a real-time microkernel. The comparison is based on expressive power, performance, and ease-of-use metrics. The result is that none of these systems has a clear advantage over the others in all the metrics, but that each has its strong and weak points. Thus any choice of a base system will involve some technical compromises, but not major ones.
- Avi Kavas, Dror G. Feitelson
- School of Computer Science and Engineering The Hebrew University, 91904 Jerusalem, Israel
- Email: feit@cs.huji.ac.il
- Received 11 February 2001, Comments to authors May 28 2001
C509: (ACES Special Issue)Parallel Simulation System for Earthquake Generation: Fault Analysis Modules and Parallel Coupling Analysis
- Abstract:Solid earth simulations have recently been developed to address issues such as natural disasters, global environmental destruction and the conservation of natural resources. The simulation of solid earth phenomena involves the analysis of complex structures including strata, faults, and heterogeneous material properties. Simulation of the generation and cycle of earthquakes is particularly important solid earth phenomena, but such a simulation requires analysis of a complex fault dynamics. GeoFEM (Iizuka et al. , 1999) is a parallel finite element analysis system intended for problems of solid earth field phenomena. This paper shows recent research of GeoFEM for simulation of earthquake generation and cycles.
  Key words: Solid earth simulations, Generation and cycle of earthquakes, GeoFEM, Parallel finite element analysis.
- Mikio Iizuka, Daigo Sekita, Hisashi Suito, Mamoru Hyodo, Karzuro Hirahara, David Place, Peter Mora, Osamu Hazama, and Hiroshi Okuda
- Organization for Information Science & Technology. Mitsubishi Reserch Institute,Inc. Earth and Planetary Sciencs, Nagoya University QUAKES, Department of Earth Sciences, The University of Queensland. Yokohama National University. Department of Quantum Engineering and Systems Science, The University ofTokyo
- Email: iizuka@tokyo.rist.or.jp
- Received 23 April 2001
C514: Parallel Dynamic Scheduling in a Cluster of Computers
- Abstract:The parallelization of complex planning and control problems arising in diverse application areas in the industrial, services, and commercial environments not only allows the determination of control variables in the required times but also improves the performance of the control procedure as more processors are involved in the execution of the parallel program. In this paper we describe a scheduling application in a water supply network to demonstrate the benefits of parallel processing. The procedure we propose combines dynamic programming with genetic algorithms and time series prediction in order to solve problems in which decisions are made in stages, and the states an control belong to a continuous space. Taking into account the computational complexity of these applications and the time contraints that are usually imposed, the procedure has been implemented by a parallel program in a cluster of computers, an inexpe sive and widely extended platform that can make parallelism a practical means of tackling complex problems in many different environments.
- M. Damas, M. Salmeron, J. Ortega, G. Olivares, H. Pomares
- Department of Computer Architecture and Computer Technology, University of Granada. Facultad de Ciencias. Campus Fuentenueva s/n. E-18071, Granada, Spain
- Email: mdamas@atc.ugr.es
- Received 13 November 2000, Comments to Authors June 2 2001
C515: A Parallel Algorithm for Static Slicing of Concurrent Programs
- Abstract:Slicing of concurrent programs is a compute-intensive task. To speed up the slicing process, we develop a parallel algorithm. For this purpose we use a Concurrent Control Flow Graph (CCFG) as the intermediate representation. We use a network of communicating processes to develop our parallel algorithm. We have implemented our parallel algorithm and the experimental results appear promising.
- D. Goswami, R. Mall
- Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur INDIA
- Email: diganta@cse.iitkgp.ernet.in
- Received 10 May 2001
C516: An Analysis of VI Architecture Primitives in Support of Parallel and Distributed Communication
- Abstract:We present the results of a detailed study of the Virtual Interface (VI) paradigm as a communication foundation for a distributed computing environment. Using Active Messages and the Split-C global memory model, we analyze the inherent costs of using VI primitives to implement these high-level communication abstractions. We demonstrate a minimum mapping cost (i.e. the host processing required to map one abstraction to a lower abstraction) of 5.4 microsec for both Active Messages and Split-C using 4-way 550 MHz Pentium III SMPs and the Myrinet network. We break down this cost to use of individual VI primitives in supporting flow control, buffer management and event processing and identify the completion queue as the source of the highest overhead. Bulk transfer performance plateaus at 44 MB/sec for both implementations due to the addition of fragmentation requirements. Based on this analysis, we present the implications for the VI successor, Infiniband.
- Andrew Begel, Philip Buonadonna, David E. Culler, David Gay
- University of California, Berkeley
- Email: philipb@cs.berkeley.edu
- Received 9 May 2001
C517: Automatic Determination of Matrix-Blocks
- Abstract:Many sparse matrices have a natural block structure, for instance arising from the discretisation of a physical domain. We give an algorithm for finding this block structure from the matrix sparsity pattern. This algorithm can be used for instance in iterative solver libraries in cases where the user does not or can not pass this block structure information to the software. The block structure can then be used as a basis for domain-decomposition based preconditioners.
- Victor Eijkhout
- Department of Computer Science; University of Tennessee, Knoxville, TN 37996
- Email: eijkhout@cs.utk.edu
- Received 5 June 2001
C518: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels
- Abstract:Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three dfferent representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches.
- Vinod Valsalam and Anthony Skjellum
- High Performance Computing Laboratory, Department of Computer Science, Mississippi State University, MS 39762.
- Email: tony@hpcl.cs.msstate.edu
- Received June 1 2001
C519: Parallel Implementation of the Fluid Particle Model for Simulating Complex Fluids in the Mesoscale
- Abstract:Dissipative particle dynamics (DPD) and its generalization - the fluid particle model (FPM) - represent the “fluid particle” approach for simulating fluid-like behavior in the mesoscale. Unlike particles from molecular dynamics (MD) method, the “fluid particle” can be viewed as a “droplet” consisting of liquid molecules. In FPM, “fluid particles” interact by both central and non-central, short-range forces with conservative, dissipative and Brownian character. In comparison to MD, FPM method in 3-D requires two to three times more memory load and three times more communication overhead. Computational load per step per particle is comparable to MD due to the shorter interaction range allowed between “fluid particles” than between MD atoms. The classical linked-cells technique and decomposing the computational box into strips allow for rapid modifications of the code and for implementing non-cubic computational boxes. We show that the efficiency of the FPM code depends strongly on the number of particles simulated, geometry of the box, and the computer architecture. We give a few examples from long FPM simulations involving up to 8 million fluid particles and 32 processors. Results from FPM simulations in 3-D of the phase separation in binary fluid and dispersion of colloidal slab are presented. Scaling law for symmetric quenching in phase separation has been properly reconstructed. We show also that the microstructure of dispersed fluid depends strongly on the contrast between kinematic viscosities of this fluid phase and the bulk phase. This FPM code can be applied for simulating mesoscopic flow dynamics in capillary pipes or critical flow phenomena in narrow blood vessels.
- Krzysztof Boryczko, Witold Dzwinel, David A.Yuen
- AGH Institute of Computer Science, al. Mickiewicza 30, 30-059, Kraków, Poland, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, Minnesota 55415-1227,USA
- Email: dwitek@msi.umn.edu
- Received June 25 2001
C520: (ACES Special Issue)Interaction between Seismogenic Body and Force-Supplying Body and Precursor in Far Field
- Abstract:Seismogenic body and wall rock of force-supply are analogized respectively by rocks sample and block of transferring pressure in test of fracture. Their interactions are used to study precursor in far field and ultra-far field. Observation studies of stress and strain are made on them. The results show that there appears mutation or disturbance precursor of strain in the force-measuring sensor or on block of transferring pressure before rock fracture beside anomaly variation appearing in all observing points on sample. The observing points on lock of transferring pressure are 25-90 cm away from sample, which are 3-10 times of fracture sizes. It is explained that precursor anomaly is possible appearing in far places from seismogenic area. The reason forming far field precursor is primarily studied.
- Xu Zhaoyong, Mei Shirong, Yang Runhai, Wang Jinglai, Zhao Jinming, Wang Bin, Hua Peizhong
- Seismological Bureau of Yunnan Province, Kunming, 650041, P. R. of China, Center for Analysis and Prediction, CSB, Beijing, 100036, P. R. of China
- Email: xuzhaoyong@netease.com
- Received Jan 12 2001