(adapted with modification from Pfister, Culler and other sources):
To start, let us look at the concept of data mining: Information processing is rapidly becoming a major market for parallel systems. We see businesses acquiring data about customers and products and devoting a great deal of computational power to automatically extract useful information or "knowledge" from this data. Examples from a customer database might include determining the buying patterns of demographic groups or segmenting customers according to relationships in their buying patterns. This process is what we call data mining.
Data mining differs from standard database queries in that its goal is to identify implicit trends and segmentations in data rather than simply look up the data requested by a direct, explicit query. For example, finding all customers who have bought dog food in the last week is not data mining; however, segmenting customers according to relationships in their age group, their monthly incomes, and their preferences in dog food, in cars, and in kitchen utensils is.
Another example of data mining is finding whether people who buy golf clubs are also likely to buy something else soon. If they are, targeted advertising materials can be stuffed in the box with clubs. Polo shirts, maybe? airline tickets? possibly ... The point is that you don't know what you'll find until you've scanned mountains of data.
Some notation:
As we just mentioned, a particular type of data mining is mining for association. Here the goal is to discover relationships (associations) in the available information related, for example, to different customers and their transactions and to generate rules for the inference of customer behavior.
Take the above golf clubs example, again our goal here is to determine associations between sets of commonly purchased items that tend to be purchased together. Let say, we are given a database in which the records correspond to a customer purchase transactions. Each transaction has a transaction identifier and a set of attributes, which in this case are the items purchased. Our first goal in mining for associations is to examine the database and determine which sets of k items, say, are found to occur together in more than a certain number (for example, n) of the transactions. Also, let say n here is some sort of a threshold or a number we choose. Now, once large itemsets of size k are found, together with their frequency of occurrence in the database, determining the association rules among them is quite easy. The problem we consider therefore focuses on discovering the large itemsets of size k and their frequencies. The database may be read from an input file.
Here are the basic insights used in association mining: if an itemset of size k is large then all subsets of that itemset must also be large. For illustration, consider a database consisting of 5 items -- A, B, C, D, and E -- of which one or more may be present in a particular transaction. The items within a transaction are lexicographically sorted. Consider L2, the list of large itemsets of size 2. The list might be {AB, AC, AD, BC, BD, CD, DE}. The itemsets within L2 are also lexicographically sorted. Given this L2, the list of itemsets that are candidates for membership in L3 are obtained by performing a joint operation on the itemsets in L2 -- that is, taking pairs of itemsets in L2 that share a common first item (say, AB and AC) and combining them into a lexicographically sorted itemset of size three (here ABC). The resulting candidate list C3 in this case is {ABC, ABD, ACD, BCD}. Of these itemsets in C3, some may actually occur with enough frequency to be placed in L3, and so on. In general, the join operation to obtain Ck from Lk-1 finds pairs of itemsets in Lk-1 whose first k-2 items are the same and combines them to create new itemsets for Ck. Itemsets of size k-1 that have common (k-2)-sized prefixes are said to form an equivalence class (e.g. {AB, AC, AD}, {BC, BD}, {CD}, and {DE} in this example for k=3). Only itemsets in the same k-2 equivalence class need to be considered together to form Ck from Lk-1, which greatly reduces the number of pairwise itemset comparisons we need to do to determine Ck.
Here is more information ...
The Sequential Algorithm
More hints on parallelization ...
Parallelization: decomposition and assignment
The above 2 sequential methods are different and also differ in their parallelization, with the second method having advantages in this respect as well. To parallelize the first method, we could first divide the database among processors. At each step, a processor traverses only its local portion of the database to determine partial occurrence counts for the candidate itemsets, incurring no communication or nonlocal disk accesses in this phase. The partial counts are then merged into global counts to determine which of the candidates are large. Thus, in parallel this method requires not only multiple passes over the database but also interprocessor communication and synchronization at the end of every pass.
The second method, the equivalence classes that helped the sequential method reduce disk accesses are very useful for parallelization as well. Since the computation on each one-equivalence class is independent of the computation on any other, we can simply divide the one-equivalence classes among processes that can thereafter proceed independently for the rest of the program without communication or synchronization. The itemset lists (in the transformed format) corresponding to an equivalence class can be stored on the local disk of the process to which the equivalence class is assigned, so no need for remote disk access remains after this point. As in the sequential algorithm, each process can complete the work on one of its assigned equivalence classes before proceeding to the next one, so each itemset record from the local disk should also be read only once as part of its equivalence class.
The challenge is ensuring a load-balanced assignment of the equivalence classes to processes. A simple metric for load balance is to assign equivalence classes based on the number of initial entries in them. However, as the computation unfolds to compute itemsets of size k, the amount of work is determined more closely by the number of large itemsets that are generated at each step. Heuristic measure that estimate this or some other more appropriate work metric can be used as well. Otherwise, we may have to resort to dynamic tasking, etc. which can compromise much of the simplicity of this method.
Steps needed in this approach:
The communication step of the transformation phase is usually the most expensive step in the algorithm and is quite like transposing a matrix, except that the sizes of the communication among different pairs of processes are different.
So, given this decomposition and assignment, let us state few more notes on spatial and temporal localities, synchronization and mapping:
What you need to do for this project
After you read and understand the above discussion and algorithms,