IBM Research

Quest Synthetic Data Generation Code

(0) Downloading and Compiling Tips:



(1) Associations and Sequential Patterns:

Code:

assoc.gen.tar.Z (26,286 bytes)

Downloading and Compiling Tips

Usage:

   gen lit|tax|seq [options]
   gen lit|tax|seq -help     For more detailed list of options
lit: large (frequent) itemsets without taxonomies
tax: large (frequent) itemsets with taxonomies
seq: sequential patterns

Output Format:

There are two posssible output formats for the data file, based on whether or not the "-ascii" option is specified.
Binary
Consists of <CustID, TransID, NumItems, List-Of-Items.> Each of these is a 4-byte integer.

Ascii
Each line contains a CustID, TransID, and Item. Each of these take up 10 bytes, for a total of 33 bytes per line.

Apart from the data file, this program also generates a pattern file. The pattern file has three parts:



(2) Classification:

Code:

classification.gen.tar.Z (24,646 bytes)

Downloading and Compiling Tips

Methodology:

The synthetic data is for a person database in which each person has the nine attributes describted below.

     Attribute    Value
     ~~~~~~~~~    ~~~~~
     Salary       uniformly distributed from 20000 to 150000
     Commission   if Salary >= 75000, Commission = 0
                  else uniformly distributed from 10000 to 75000
     Age          uniformly distributed from 20 to 80
     Education    uniformly chosen from 0 to 4
     Car          make of the car, uniformly chosen from 1 to 20
     ZipCode      uniformly chosen from 9 available zipcodes
     HouseValue   uniformly distributed from 0.5*k*100000 to
                  1.5*k*100000, where 0 <= k <= 9 and depends on
                  the ZipCode
     YearsOwned   uniformly distributed from 1 to 30
     Loan         uniformly distributed from 0 to 500000
Attributes educationLevel, car, and zipCode are categorical, and the rest are numeric. The attribute values are randomly generated. There is a derived attribute also, called Equity, defined as follows:
          if YearsOwned < 20
                Equity = 0
          else
                Equity = 0.1 * ( YearsOwned - 20 )
We developed a series of classification functions of increasing complexity that used the above attributes to classify people into different groups. Tuples in the training set were assigned the group label by first generating the tuple and then applying the classification function on the tuple to determine the group to which the tuple belongs.

It is rarely the case that the boundaries between the groups are very sharp. To model fuzzy boundaries, the data generation program takes a perturbation factor $p$ as an additional argument. After determining the values of different attributes of a tuple and assigning it a group label, the values for non-categorical attributes are perturbed. If the value of an attribute A for a tuple t v and the range of values of A is a, then the value of A for t after perturbation becomes v + r*p*a, where r is a uniform random variable between -0.5 and +0.5.

Usage:

   pred [options]
   pred -help     For more detailed list of options

Output Format:

Each line contains a record of 54 bytes with 9 attributes.


You are visitor number to the Quest Synthetic Data Page since October 12, 1996.


If you have any questions, comments or suggestions, please send a mail to the QUEST group.

[ QUEST Home | Technologies | Publications | Demos & Goodies | Seminars | Links & Info | People ]
[ IBM home page | Order | Search | Contact IBM | Help | (C) | (TM) ]