Data Mining

Data Mining

(adapted with modification from Pfister, Culler and other sources):

To start, let us look at the concept of data mining: Information processing is rapidly becoming a major market for parallel systems. We see businesses acquiring data about customers and products and devoting a great deal of computational power to automatically extract useful information or "knowledge" from this data. Examples from a customer database might include determining the buying patterns of demographic groups or segmenting customers according to relationships in their buying patterns. This process is what we call data mining.

Data mining differs from standard database queries in that its goal is to identify implicit trends and segmentations in data rather than simply look up the data requested by a direct, explicit query. For example, finding all customers who have bought dog food in the last week is not data mining; however, segmenting customers according to relationships in their age group, their monthly incomes, and their preferences in dog food, in cars, and in kitchen utensils is.

Another example of data mining is finding whether people who buy golf clubs are also likely to buy something else soon. If they are, targeted advertising materials can be stuffed in the box with clubs. Polo shirts, maybe? airline tickets? possibly ... The point is that you don't know what you'll find until you've scanned mountains of data.

Some notation:

As we just mentioned, a particular type of data mining is mining for association. Here the goal is to discover relationships (associations) in the available information related, for example, to different customers and their transactions and to generate rules for the inference of customer behavior.

Take the above golf clubs example, again our goal here is to determine associations between sets of commonly purchased items that tend to be purchased together. Let say, we are given a database in which the records correspond to a customer purchase transactions. Each transaction has a transaction identifier and a set of attributes, which in this case are the items purchased. Our first goal in mining for associations is to examine the database and determine which sets of k items, say, are found to occur together in more than a certain number (for example, n) of the transactions. Also, let say n here is some sort of a threshold or a number we choose. Now, once large itemsets of size k are found, together with their frequency of occurrence in the database, determining the association rules among them is quite easy. The problem we consider therefore focuses on discovering the large itemsets of size k and their frequencies. The database may be read from an input file.

Here are the basic insights used in association mining: if an itemset of size k is large then all subsets of that itemset must also be large. For illustration, consider a database consisting of 5 items -- A, B, C, D, and E -- of which one or more may be present in a particular transaction. The items within a transaction are lexicographically sorted. Consider L2, the list of large itemsets of size 2. The list might be {AB, AC, AD, BC, BD, CD, DE}. The itemsets within L2 are also lexicographically sorted. Given this L2, the list of itemsets that are candidates for membership in L3 are obtained by performing a joint operation on the itemsets in L2 -- that is, taking pairs of itemsets in L2 that share a common first item (say, AB and AC) and combining them into a lexicographically sorted itemset of size three (here ABC). The resulting candidate list C3 in this case is {ABC, ABD, ACD, BCD}. Of these itemsets in C3, some may actually occur with enough frequency to be placed in L3, and so on. In general, the join operation to obtain Ck from Lk-1 finds pairs of itemsets in Lk-1 whose first k-2 items are the same and combines them to create new itemsets for Ck. Itemsets of size k-1 that have common (k-2)-sized prefixes are said to form an equivalence class (e.g. {AB, AC, AD}, {BC, BD}, {CD}, and {DE} in this example for k=3). Only itemsets in the same k-2 equivalence class need to be considered together to form Ck from Lk-1, which greatly reduces the number of pairwise itemset comparisons we need to do to determine Ck.

Here is more information ...

The Sequential Algorithm

  1. A simple sequential method for association mining is to first traverse the data set and record the frequencies of all itemsets of size one, thus determine L1. From L1, we can construct the candidate list C2 and then traverse the data set again to find which entries of C2 occur frequently enough to be placed in L2. From L2, we can construct C3 and then traverse the data set to determine L3, and so on until we have found Lk. Although this method is simple, it requires reading all transactions in the database from disk (or some other storage) k times, which is expensive.

  2. An alternative sequential algorithm seeks to reduce the amount of work done to compute candidate list Ck from lists of large itemsets Lk-1 and especially to reduce the number of times data must be read from disk to determine the frequencies of the itemsets in Ck. Equivalence classes can be used to achieve both goals together. The idea is to transform the way in which data is stored in the database. Instead of storing transactions in the form {Tx, A, B, D, ...} -- where Tx is the transaction identifier and A, B, D are items in the transaction -- we can keep in the database records of the form {ISx, T1, T2, T3, ...}, where ISx is an itemset and T1, T2, and so on are transactions that contain itemsets. That is, a database record is maintained per itemset rather than per transaction. If the large itemsets of size k-1 (i.e. the elements of Lk-1) that are in the same k-2 equivalence class are identifies, then computing the candidate list Ck requires only examining all pairs of these itemsets. Since each itemset has its list of transactions attached, the size of each resulting itemset in Ck can be computed at the same time as constructing the Ck itemset itself from a pair of Lk-1 itemsets, by simply computing the intersection of the transactions in that pairs lists.

More hints on parallelization ...

Parallelization: decomposition and assignment

The above 2 sequential methods are different and also differ in their parallelization, with the second method having advantages in this respect as well. To parallelize the first method, we could first divide the database among processors. At each step, a processor traverses only its local portion of the database to determine partial occurrence counts for the candidate itemsets, incurring no communication or nonlocal disk accesses in this phase. The partial counts are then merged into global counts to determine which of the candidates are large. Thus, in parallel this method requires not only multiple passes over the database but also interprocessor communication and synchronization at the end of every pass.

The second method, the equivalence classes that helped the sequential method reduce disk accesses are very useful for parallelization as well. Since the computation on each one-equivalence class is independent of the computation on any other, we can simply divide the one-equivalence classes among processes that can thereafter proceed independently for the rest of the program without communication or synchronization. The itemset lists (in the transformed format) corresponding to an equivalence class can be stored on the local disk of the process to which the equivalence class is assigned, so no need for remote disk access remains after this point. As in the sequential algorithm, each process can complete the work on one of its assigned equivalence classes before proceeding to the next one, so each itemset record from the local disk should also be read only once as part of its equivalence class.

The challenge is ensuring a load-balanced assignment of the equivalence classes to processes. A simple metric for load balance is to assign equivalence classes based on the number of initial entries in them. However, as the computation unfolds to compute itemsets of size k, the amount of work is determined more closely by the number of large itemsets that are generated at each step. Heuristic measure that estimate this or some other more appropriate work metric can be used as well. Otherwise, we may have to resort to dynamic tasking, etc. which can compromise much of the simplicity of this method.

Steps needed in this approach:

So, given this decomposition and assignment, let us state few more notes on spatial and temporal localities, synchronization and mapping:

  1. The organization of the computation and the lexicographic sorting of the data to be a simple front-to-back sweeps that show a good predictability and spatial locality.

  2. As mentioned above, proceeding over one equivalence class at a time is much like blocking, although how successful it is depends on whether the data for that equivalence class fits in main memory. As computation for an equivalence class proceeds, the number of large itemsets becomes smaller ...

  3. The major forms of synchronization are the reductions of partial occurrence counts into global counts in the first step of the algorithm (computing the large size 2 itemsets) and a barrier after this to begin the transformation phase. The reduction is required only for itemsets of size 2 since thereafter every process continues independently to compute the large itemsets of size k in its assigned equivalence classes. Further synchronization may be needed if dynamic task management is used for load balancing.

  4. The communication to transform the database is all-to-all: a process may "send" different itemsets of size 2 and their partial transaction lists to all other processes and may "receive" or read such lists from all of them. It is difficult to map all-to-all communication in a contention-free manner to network topologies (like meshes or rings) that are not very richly interconnected. Endpoint contention is reduced by communication scheduling techniques such as having each processor i at step j exchange data with processor i xor j so that no processor or node is overloaded.

What you need to do for this project

After you read and understand the above discussion and algorithms,


saleh@npac.syr.edu