QUARTERLY R&D STATUS REPORT (September 1996)
Wei Li |
Briefly describe activity for each task/subtask.
Rochester's goal in this project is to access the utility of the new runtime system for distributed shared-memory architectures. During the past year, we made a lot of progress towards our goal of building a parallelizing compiler and runtime system for distributed shared-memory machines. We have been studying the design of the new Compiler Runtime Systems in the context of data transformations and optimizations for distributed shared-memory machines. Our work can be divided into three categories: locality optimizations, scheduling for network of workstations, parallel and distributed data mining. We are also building a Java compiler.
(1) For locality optimizations, we have developed compile-time optimizations that reduce false sharing. Many parallel computers provide shared memory with coherent caches. False sharing is a serious problem for many applications that run on such multiprocessors. This problem has been studied both in the context of generating parallel code from sequential programs and of programs with user-level parallelism. We consider explicitly parallel programs with user level parallelism that conform to the SPMD (Single Program Multiple Data) model. Several ways to eliminate false sharing in programs with task parallelism have been proposed. Most of the solutions to date require programmer intervention. We propose methods that can be automatically applied to effectively eliminate false sharing in some classes of applications. We use two techniques: array remapping (static and dynamic) and software caching. We implement those techniques in a source-to-source optimizer for C programs. We test the effectiveness of the transformations on a set of six applications and conclude that a significant improvement of the running time can be achieved. For all but the smallest data sizes, an improvement of at least a factor of two is observed. For some applications the execution times are reduced as much as ten times. This result has been published in the 10th Annual International Conference on High Performance Computers, June 5-7 1996, Ottawa, Canada.
In another paper on data reuse for distributed shared-memory machines, we have compared experimentally the inherent data reuses in the standard benchmarks, the data locality behavior from program execution and the reuse patterns captured by static compiler models. The inherent data reuses provide the machine-independent upper bound on data reuses. We define 3 levels of reuse: {\em program reuse}, {\em loop nest reuse}, and {\em innermost loop reuse}. We show that there is more reuse to be exploited if we consider optimizations across loop nests. We observe that static models only capture reuses within the innermost loops or a loop nest. Most of the static models provide the similar information about the innermost loop reuses. However, we need more global compile-time reuse analysis and algorithms to exploit global locality. This result has been published in the 10th Annual International Conference on High Performance Computers, June 5-7 1996, Ottawa, Canada.
(2) For networks of workstations, our recent work showed that a compile and run-time modeling and decision process which, combined with a run-time load balancing library, would enable us to choose the best strategy. Since there are a large number of loop scheduling techniques which have been proposed in the literature, this analysis helps a parallelizing compiler in customizing the dynamic load balancing strategy for a given application. A paper on this result has been accepted for publication in the 5th IEEE International Symposium on High-Performance Distributed Computing}, Syracuse, New York, August 1996.
(3) Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this work we present parallel algorithms for data mining of association rules, and study the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor. We further present a set of optimizations for the sequential and parallel algorithms. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains. A paper on this result has been accepted for publication in Supercomputing'96, Pittsburgh, Pennsylvania, Nov. 1996.
In another piece of work, we look at the problem of compiler analysis for database programs. Most parallel databases exploit two types of parallelism: {\em intra-query} parallelism and {\em inter-transaction} concurrency. Between these two cases lies another type of parallelism: {\em inter-query} parallelism within a transaction or application. Exploiting inter-query parallelism requires either compiler support to automatically parallelize the existing {\em embedded} query programs, or programming support to write explicitly parallel query programs. We have developed compiler analysis to automatically detect parallelism in the {\em embedded query programs}. We have proposed compiler dependence test algorithms for detecting both {\em internal} and {\em external} dependences. A paper on this result has been accepted for publication in the Eighth IEEE Symposium on Parallel and Distributed Processing, New Orleans, Louisiana, October, 1996.
(4) In this work, we present a Java compiler architecture which uses a unique combination of front- and back-ends to deliver great flexibility. Our compiler is designed to use the same optimization passes no matter which pair of front- and back-end is used. More details can be found in "Briki: a flexible Java compiler" by M. Cierniak and W. Li, TR 621, Computer Science Dept., U. Rochester, May 1996.
As a companion to the Java compiler work, we have built a system for network based visualization of profile information generated by Java applets/bytecode. The system, called {\bf NetProf}, is composed of several components each of which is interesting in their own right. The components are a bytecode to Java source code translator, a profiler that includes a static pass to insert profiler code, a dynamic runtime library that records relevant events and finally a visualization mechanism which highlights the profile information in an easy to use manner. All of this can be done over the Internet using a client-server approach and is independent of the underlying architecture/machine and human intervention. More details can be found in "NetProf: Network-based high-level profiling of Java bytecode" by S. Parthasarathy, M. Cierniak, and W. Li, TR 622, Computer Science Dept., U. Rochester, May 1996.
We have been implementing the new optimizations we have proposed. We expect to target the compiler to the new runtime system.
We have developed a prototype Java compiler called Briki, and a network visualization system called NetProf.
No change.
We have published the following papers:
We will continue to study the proposed Common Data Descriptor Draft to see its utility in data transformations and optimizations for distributed shared-memory machines.
We will also study the address translation mechanisms to see its utility in data transformations and optimizations for distributed shared-memory machines.
We will implement our new results in our compiler, and retarget the compiler to the new runtime systems as it is being developed.
Planned: $48,000 (1/1/96--8/31/96), $150,000 (8/15/94--8/31/96)
Actual: $47,595 (1/1/96--8/31/96), $143,406 (8/15/94--8/31/96)
No. Current funding will run out by September 30, 1996. We will need the remaining $75,000 to complete the project for the final period.
We will need the $75,000 originally budgeted for the third year (8/15/96-- 8/14/97). This would give us a cumulative budget of $225,000.