Data mining samples based on Twister

Abstract

Large scale data are collected in many kinds of application fields. How to explore useful knowledge from the raw data to support decision is a very popular research topic. Many data intensive data mining technologies have been developed. Map/Reduce is considered as the most efficient one. Twister software developed by PTI is based on iterative Map/Reduce technology. It improves the computation speed markedly in dealing with data intensive problems. The project mainly studies the realization of several data mining methods, such as SVM, Apriori, correlation analysis based on Entropy and so on, based on Twister. The program will be developed under Linux platform with Java. The methods are used to analyze bioinformatics or biomedical examples.

Intellectual Merit

Commonly used data mining methods have been widely studied and become mature. Most of them are serial and cannot be used to analyze large scale data. How to parallelize data mining methods is a very hard task. It needs to study and redesign the structure of data mining algorithms so that they can be realized in parallel. Through the study of data mining method programming, it can both improve the computation efficiency and keep the data analysis precision.

Broader Impact

The research results of the projects can be used in many kinds of large scale data mining fields. It will support the user to find useful knowledge from large scale raw data. It will contribute the development of large scale data analysis research.

Use of FutureGrid

Use FutureGrid to build Twister development environment.

Scale Of Use

A few VMs for an experiment.

Publications

Project Number: FG-138

Project Lead: Zhanquan Sun

Institution: Indiana University Bloomington

Project Status: Active

View Project Details

FutureGrid Experts

Bingjing Zhang

Yang Ruan

Keywords

cloud computing, data mining, map/reduce, svm, twister

Timeline

Updated: 3 years 10 weeks ago