Data mining samples based on Twister

Abstract

Large scale data are collected in many kinds of application fields. How to explore useful knowledge from the raw data to support decision is a very popular research topic. Many data intensive data mining technologies have been developed. Map/Reduce is considered as the most efficient one. Twister software developed by PTI is based on iterative Map/Reduce technology. It improves the computation speed markedly in dealing with data intensive problems. The project mainly studies the realization of several data mining methods, such as SVM, Apriori, correlation analysis based on Entropy and so on, based on Twister. The program will be developed under Linux platform with Java. The methods are used to analyze bioinformatics or biomedical examples.

Intellectual Merit

Commonly used data mining methods have been widely studied and become mature. Most of them are serial and cannot be used to analyze large scale data. How to parallelize data mining methods is a very hard task. It needs to study and redesign the structure of data mining algorithms so that they can be realized in parallel. Through the study of data mining method programming, it can both improve the computation efficiency and keep the data analysis precision.

Broader Impact

The research results of the projects can be used in many kinds of large scale data mining fields. It will support the user to find useful knowledge from large scale raw data. It will contribute the development of large scale data analysis research.

Use of FutureGrid

Use FutureGrid to build Twister development environment.

Scale Of Use

A few VMs for an experiment.

Publications


FG-138
Zhanquan Sun
Indiana University Bloomington
Active

FutureGrid Experts

Bingjing Zhang
Yang Ruan

Timeline

3 years 10 weeks ago