Note section numbers such as 2.1syr refer to text in original document and are hyperlinked to it
With large volumes of routine data having been collected, many organizations are increasingly turning to the extraction of useful information from such databases. Such high-level inference processes may provide information on patterns from large databases of particular interest to defense analysis.
Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from such large databases. Data mining is in fact a broad area which combines research in machine learning, statistics and databases. It can be broadly classified into three main categories: Classification -- finding rules that partition the database into disjoint classes; Sequences -- extracting commonly occurring sequences in temporal data; and Associations -- find the set of most commonly occurring groupings of items.
At Rochester, we have worked on the problem of mining association rules in parallel over basket data on both distributed shared-memory multiprocessors and a network of machines. Basket data usually consists of a record per customer with a transaction date, along with items bought by the customer. Many real-world applications can be formulated in such a manner. Our preliminary experimental results on a SGI Power Challenge shared-memory multiprocessor machine and a DEC Memory Channel cluster are very encouraging.
The next step is to consider scenarios where the database server is accessible via the Internet. The machine independent nature of Java makes it ideally suited to writing both the server and client code. The server could be a collection of machines, i.e., it could be a single tightly-coupled multi-processor shared or distributed memory machine, it could be a network of workstations, or any combination of these with heterogeneous processing elements, and heterogeneous interconnection network. The client side could be a heterogeneous collection of machines as well. We need compiler and run-time support to manage both the client and the server. The compiler should be able to insert code to exploit the local network configurations on both ends, while the run-time system monitors dynamic behavior and executes different actions according to the transient information. We have implemented a client-server data mining algorithm in Java. The preliminary results will be discussed in more detail in Section 8.5.
Data mining algorithms tend to have program behaviors quite different from traditional Grand Challenge problems in terms of both data access patterns and control flow. Unlike simple array structures used in scientific code, the data structures used in mining algorithms include hash trees, hash tables, and linked lists. Consider mining for associations: the first step consists of finding all the sets of items (called itemsets), that occur in the database with a certain user-specified frequency (called minimum support). Such itemsets are called large itemsets. An itemset of k items is called a k-itemset. The general structure of algorithms for mining associations are as follows. The initial pass over the database counts the support for all 1-itemsets. The large 1-itemsets are used to generate candidate 2-itemsets. The database is scanned again to obtain occurrence counts for the candidates, and the large 2-itemsets are selected for the next pass. This iterative process is repeated for k = 3,4, ... till there are no more large k-itemsets to be found. Effective parallelization and networked computing require new research in algorithms, compilers and runtime systems.
The current Java compiler takes the hybrid approach of compilation and interpretation. At Rochester, we would like to exploit more options to generate efficient code for High Performance Java applications. We consider three scenarios:
At Rochester, we are implementing compiler infrastructure with multiple front- and back-ends which will uniformly treat all scenarios mentioned above (cf. Section 8.4). We expect to include traditional optimizations like cache optimizations, register allocations etc.
(8.2.1roc, 8.2.2roc, 8.2.3ind)
Threads in Java are supported for concurrency. This is useful as perceived by a user, when response time is important, e.g. the HotJava browser where you can scroll a page while down-loading an image. It currently does not support threads running on separate processors. Our main objective in this section is to support parallel threads, at no expense to the Java programmer. We would prefer not to change, modify or limit the language as perceived by the user.
There seem to be two possible scenarios:
At Rochester, we are particularly interested in implementing a parallel thread package for Java with special compiler support on distributed shared-memory (DSM) machines. On scalable multiprocessors, communication is much more expensive than computation. Many different protocols have been proposed for behavior-driven locality management. None of them work best for all types of data sharing. Performance would be increased if the system could adapt to the program's memory access patterns. An example could be the use of multithreading to alleviate memory latency (especially on remote accesses) aggravated by delays introduced by the memory coherency protocol. We propose to analyze programs at compile time and annotate them with data access summary information. Exploiting this information will require the ability to switch among multiple data coherence protocols at run time.
As part of our effort to better understand the compilation issues, at Rochester we are developing a Java compiler which could be used to optimize Java programs.
The compiler would use a common intermediate representation and a collection of front- and back-ends to facilitate re-use of components and independence of optimization passes.
There are two possible front-ends: one would read Java source, the other one would read bytecode.
A set of back-ends would include Java source generation, bytecode generation, and "native" code generation (either in form of some target HLL like C, or actual native code for some architectures).
/=====> Java Java ----\ / ======> IR =====-------> bytecode bytecode ====/ \ \-----> C (or other "native")
Our current development concentrates on a bytecode front-end and Java back-end (shown above with double lines). The choice was made, because using bytecode as an input is more general (if we have access to source, we also have access to its bytecode form -- the converse is not true). Using Java as output has a big advantage in the development phase.
The implementation is not complete: only a subset of legal bytecode constructs is implemented and the definition of our IR is not stable. However, preliminary results are promising. We are able to identify implicit high level constructs such as loops with good success. Also our high level IR makes implementation of code transformation relatively simple.
Code instrumentation is our first application of the compiler framework. We are developing a pass which inserts instrumentation calls. Currently we are instrumenting loops, if-then-else statements, and methods. We are planning to look into the performance of a range of Java applications. The objective is to determine if there are any performance issues that are unique to Java the language (the fact that it is interpreted), and/or applications written in Java (it could be the case that only certain kinds of applications are worth writing in Java).
An Example.
We are given a bytecode representation of a method corresponding to the following disassembled code. Method void paintAll(java.awt.Graphics) 0 iconst_0 1 istore_2 2 goto 19 5 aload_0 6 aload_0 7 getfield #7 <Field PCRCexample.CellList [I> 10 iload_2 11 iaload 12 aload_1 13 invokevirtual #6 <Method PCRCexample.plotCell(ILjava/awt/Graphics;)V> 16 iinc 2 1 19 iload_2 20 aload_0 21 getfield #4 <Field PCRCexample.n I> 24 if_icmplt 5 27 aload_0 28 aload_1 29 invokevirtual #3 <Method PCRCexample.updateStatusLine(Ljava/awt/Graphics;)V > 32 return
We can parse the class file and recover a high-level intermediate representation which corresponds to the following source (this was generated automatically by our Java-source back-end).
public void paintAll ( java.awt.Graphics a1 ) { int v0; v0 = 0; while((v0 < this.n)) { this.plotCell(this.CellList[v0], a1); v0 = (v0 + 1); } //while this.updateStatusLine(a1); return; } // paintAll
Since our IR explicitly identifies high-level constructs (e.g.,loops), it is easy to instrument the code and/or perform optimizing transformations.
We have implemented a preliminary version of network-based data mining at Rochester. Currently, the server is a passive entity in terms of the data mining computation. It's only job is to accept client connections and ship off requested data. The mining process is carried out at the client end. Both the client and server are currently sequential.
We now present some timing results vis-a-vis a C++ implementation of the algorithm which doesn't transmit any data over the network. Below we show two synthetic databases from IBM we looked at:
Database Num_Trans Avg_Trans_Size Total_Size ----------------------------------------------------------- T5.I2.D100K 100,000 5 2.615 Mbytes T10.I4.D100K 100,000 10 4.31 Mbytes These database are accessible only to the server through a file-server. We now present the total execution time for the algorithm. The Java_Time includes the server disk-read time, transmission time and client computation time. C++_Time only gives the disk-read time and computation time. Database Total_Size Java_Time C++_Time -------------------------------------------------- T5.I2.D100K 2.615 Mbytes 367.7s 15.3s T10.I4.D100K 4.31 Mbytes 1847.2s 78.9s The break-up of the execution time is as follows: Database Total_Size Java:Reading Computation C++:Reading Computation ------------------------------------------------------------------------ T5.I2.D100K 2.615 Mbytes 292.6s 75.1s 8.8s 6.5s T10.I4.D100K 4.31 Mbytes 478.4s 1368.8s 13.8s 65.1s In Java the net disk-read time and shipping time for data is at the rate of 9Kb/sec vs. 0.31Mb/sec disc read time in C++.
From the experiments, we make two observations:
We will address these issues in our future research.