University of Rochester Report (September 1996)

QUARTERLY R&D STATUS REPORT
(September 1996)

Wei Li
Department of Computer Science
University of Rochester

PART I: Program Technical Status

Description of progress during reporting period, supported by reasons for any changes reported previously
Briefly describe activity for each task/subtask.
Rochester's goal in this project is to access the utility of the new runtime system for distributed shared-memory architectures. During the past year, we made a lot of progress towards our goal of building a parallelizing compiler and runtime system for distributed shared-memory machines. We have been studying the design of the new Compiler Runtime Systems in the context of data transformations and optimizations for distributed shared-memory machines. Our work can be divided into three categories: locality optimizations, scheduling for network of workstations, parallel and distributed data mining. We are also building a Java compiler.
- Task 1: Common Code and Data Descriptors for Arrays in HPC Languages
- Task 2: Address Translation Mechanisms
- Task 3: Common Computational Functions
- Task 4: Common Data Movement Routines
- Task 5: Common Data Movement Deriving and Optimizing Routines
  (1) For locality optimizations, we have developed compile-time optimizations that reduce false sharing. Many parallel computers provide shared memory with coherent caches. False sharing is a serious problem for many applications that run on such multiprocessors. This problem has been studied both in the context of generating parallel code from sequential programs and of programs with user-level parallelism. We consider explicitly parallel programs with user level parallelism that conform to the SPMD (Single Program Multiple Data) model. Several ways to eliminate false sharing in programs with task parallelism have been proposed. Most of the solutions to date require programmer intervention. We propose methods that can be automatically applied to effectively eliminate false sharing in some classes of applications. We use two techniques: array remapping (static and dynamic) and software caching. We implement those techniques in a source-to-source optimizer for C programs. We test the effectiveness of the transformations on a set of six applications and conclude that a significant improvement of the running time can be achieved. For all but the smallest data sizes, an improvement of at least a factor of two is observed. For some applications the execution times are reduced as much as ten times. This result has been published in the 10th Annual International Conference on High Performance Computers, June 5-7 1996, Ottawa, Canada.
  In another paper on data reuse for distributed shared-memory machines, we have compared experimentally the inherent data reuses in the standard benchmarks, the data locality behavior from program execution and the reuse patterns captured by static compiler models. The inherent data reuses provide the machine-independent upper bound on data reuses. We define 3 levels of reuse: {\em program reuse}, {\em loop nest reuse}, and {\em innermost loop reuse}. We show that there is more reuse to be exploited if we consider optimizations across loop nests. We observe that static models only capture reuses within the innermost loops or a loop nest. Most of the static models provide the similar information about the innermost loop reuses. However, we need more global compile-time reuse analysis and algorithms to exploit global locality. This result has been published in the 10th Annual International Conference on High Performance Computers, June 5-7 1996, Ottawa, Canada.
  (2) For networks of workstations, our recent work showed that a compile and run-time modeling and decision process which, combined with a run-time load balancing library, would enable us to choose the best strategy. Since there are a large number of loop scheduling techniques which have been proposed in the literature, this analysis helps a parallelizing compiler in customizing the dynamic load balancing strategy for a given application. A paper on this result has been accepted for publication in the 5th IEEE International Symposium on High-Performance Distributed Computing}, Syracuse, New York, August 1996.
  (3) Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. Many algorithms have been proposed for data mining of association rules. However, research so far has mainly focused on sequential algorithms. In this work we present parallel algorithms for data mining of association rules, and study the degree of parallelism, synchronization, and data locality issues on the SGI Power Challenge shared-memory multi-processor. We further present a set of optimizations for the sequential and parallel algorithms. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm, but we observe a need for parallel I/O techniques for further performance gains. A paper on this result has been accepted for publication in Supercomputing'96, Pittsburgh, Pennsylvania, Nov. 1996.
  In another piece of work, we look at the problem of compiler analysis for database programs. Most parallel databases exploit two types of parallelism: {\em intra-query} parallelism and {\em inter-transaction} concurrency. Between these two cases lies another type of parallelism: {\em inter-query} parallelism within a transaction or application. Exploiting inter-query parallelism requires either compiler support to automatically parallelize the existing {\em embedded} query programs, or programming support to write explicitly parallel query programs. We have developed compiler analysis to automatically detect parallelism in the {\em embedded query programs}. We have proposed compiler dependence test algorithms for detecting both {\em internal} and {\em external} dependences. A paper on this result has been accepted for publication in the Eighth IEEE Symposium on Parallel and Distributed Processing, New Orleans, Louisiana, October, 1996.
  (4) In this work, we present a Java compiler architecture which uses a unique combination of front- and back-ends to deliver great flexibility. Our compiler is designed to use the same optimization passes no matter which pair of front- and back-end is used. More details can be found in "Briki: a flexible Java compiler" by M. Cierniak and W. Li, TR 621, Computer Science Dept., U. Rochester, May 1996.
  As a companion to the Java compiler work, we have built a system for network based visualization of profile information generated by Java applets/bytecode. The system, called {\bf NetProf}, is composed of several components each of which is interesting in their own right. The components are a bytecode to Java source code translator, a profiler that includes a static pass to insert profiler code, a dynamic runtime library that records relevant events and finally a visualization mechanism which highlights the profile information in an easy to use manner. All of this can be done over the Internet using a client-server approach and is independent of the underlying architecture/machine and human intervention. More details can be found in "NetProf: Network-based high-level profiling of Java bytecode" by S. Parthasarathy, M. Cierniak, and W. Li, TR 622, Computer Science Dept., U. Rochester, May 1996.
- Task 6: Common Compiler Data Movement Interface Specifications
- Task 7: Software Engineering
  - Task 7.1: Engineering the Runtime System for Reuse and Extensibility
  - Task 7.2: Test and Validation
- Task 8: Retargeting Compilers
  We have been implementing the new optimizations we have proposed. We expect to target the compiler to the new runtime system.
- Task 9: Base Compiler at Syracuse
Description of any special programs developed or major items of experimental or special equipment purchased or constructed during reporting period.
We have developed a prototype Java compiler called Briki, and a network visualization system called NetProf.
Notifications of any changes in key personnel associated with the contract during the reporting period.
No change.
Summary of substantive information derived from noteworthy trips, meetings, and special conferences held in connection with the contract during the reporting period
Summary of any problems or areas of concern on which procuring activity or guidance is required
Statement relative to the apparent sufficiency of contractually prescribed effort to achieve the objectives of the contract
Related accomplishments:
We have published the following papers:
- M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li, ``Parallel Data Mining for Association Rules on Shared-memory Multi-processors'', to appear in {\em the Proceedings of Supercomputing'96}, Pittsburgh, Pennsylvania, Nov. 1996.
- S. Parthasarathy, W. Li, M. Cierniak, and M. J. Zaki, ``Compile-time Inter-query Dependence Analysis'', to appear in {\em the proceedings of the Eighth IEEE Symposium on Parallel and Distributed Processing}, New Orleans, Louisiana, October, 1996
- M. J. Zaki, W. Li, and S. Parthasarathy, ``Customized Dynamic Load Balancing for a Network of Workstations'', to appear in {\em the Proceedings of the 5th IEEE International Symposium on High-Performance Distributed Computing}, Syracuse, New York, August 1996.
- M. Cierniak and W. Li, ``Evaluation of Data Reuse, Locality and Compiler Models'', in {\em the proceedings of the 10th Annual International Conference on High Performance Computers}, June 5-7 1996, Ottawa, Canada.
- M. Cierniak and W. Li, ``A Practical Approach to the Compile-Time Elimination of False Sharing for Explicitly Parallel Programs'', in {\em the proceedings of the 10th Annual International Conference on High Performance Computers}, June 5-7 1996, Ottawa, Canada.
- S. Parthasarathy, M. Cierniak, and W. Li, "NetProf: Network-based high-level profiling of Java bytecode," TR 622, Computer Science Dept., U. Rochester, May 1996.
- M. Cierniak and W. Li, "Briki: a flexible Java compiler," TR 621, Computer Science Dept., U. Rochester, May 1996.
- M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara, "Evaluation of sampling for data mining of association rules," TR 617, Computer Science Dept., U. Rochester, May 1996.
Planned activities and milestones for the next reporting period
- Task 1: Common Code and Data Descriptors for Arrays in HPC Languages
  We will continue to study the proposed Common Data Descriptor Draft to see its utility in data transformations and optimizations for distributed shared-memory machines.
- Task 2: Address Translation Mechanisms
  We will also study the address translation mechanisms to see its utility in data transformations and optimizations for distributed shared-memory machines.
- Task 3: Common Computational Functions
- Task 4: Common Data Movement Routines
- Task 5: Common Data Movement Deriving and Optimizing Routines
- Task 6: Common Compiler Data Movement Interface Specifications
- Task 7: Software Engineering
  - Task 7.1: Engineering the Runtime System for Reuse and Extensibility
  - Task 7.2: Test and Validation
- Task 8: Retargeting Compilers
  We will implement our new results in our compiler, and retarget the compiler to the new runtime systems as it is being developed.
- Task 9: Base Compiler at Syracuse
List of all equipment purchased during reporting period

PART II: Program Financial Status

List cost to date, whether billed or not
Planned: $48,000 (1/1/96--8/31/96), $150,000 (8/15/94--8/31/96)

Actual: $47,595 (1/1/96--8/31/96), $143,406 (8/15/94--8/31/96)
Answer the following questions:
1. Is current funding sufficient for the current FY?
  No. Current funding will run out by September 30, 1996. We will need the remaining $75,000 to complete the project for the final period.
2. What is the next Fiscal Year's funding requirement at current anticipated levels?
  We will need the $75,000 originally budgeted for the third year (8/15/96-- 8/14/97). This would give us a cumulative budget of $225,000.

Last change: September 1996 by Wei Li

Please report problems with this server to wei@cs.rochester.edu.