Skip to:

e-Science 2008 4th IEEE International Conference on e-Science

Exhibits, Demos & Posters

SALSA project: Parallel Data Mining of GIS, Web, Medical, Physics, Chemical, and Biology data

Authors

  • Geoffrey Fox, Computer Science, Informatics and Physics Department, Indiana University

  • Xiaohong Qiu, UITS, Indiana University

  • Seung-Hee Bae, Computer Science Department, Indiana University

  • Jong Youl Choi, Computer Science Department, Indiana University

  • Jaliya Ekanayake, Computer Science Department, Indiana University

  • Yang Ruan, Computer Science Department, Indiana University

Abstract

The multicore revolution promises potentially hundreds of cores in desktop computers. The ever increasing number of cores per chip will be accompanied by a pervasive data deluge whose size will probably increase even faster than CPU core count over the next few years. This suggests the importance of parallel data analysis and data mining applications with good multicore, cluster and grid performance.  The SALSA project at Community Grid Lab of Indiana University is looking to revolutionize the way software is written in parallel for real applications that advance scientific discovery and improve the quality of people’s life.  

SALSA stands for Service Aggregated Linked Sequential Activities. The application composition generalizes the well know CSP (Communicating Sequential Processes) of Hoare to describe the low level approaches to fine grain parallelism as “Linked Sequential Activities” in SALSA. The term “activities” defines kernel component for building services from either threads, processes or even just other services. The critical communication optimization problem (inside chips, clusters, and Grids) is addressed by “linkage” with different ways of synchronization among parallel activities on hydrogenous platforms (shared memory and distributed memory).  For details, please visit www.infomall.org/salsa.

This presentation will include demonstrations of two categories—applications using a suite of parallel data mining capabilities we have developed and performance benchmark of various programming models (Microsoft CCR, DSS and Dryad; CGL-MapReduce; Hadoop; MPI) and different programming languages (C#, C, C++ and Java) on classes of multicore hardware up to 128 cores.  The list of presentations is as follows:

Deterministic Annealing Clustering (DAC) of Indiana Census Data

The data of Indiana Census 2000 includes 200,000 records and each has 55 dimensions or fields.  By gradually decreasing temperature (distance scale), DAC discovers more clusters (partition of subgroups) and avoid from poor results of local minima. The demo shows how one could use clustering to find locations of different population types (e.g. age and ethnics) and display the results in Microsoft virtual Earth.

Metric Space Mapping (MDS) for visualization and analysis of Chemical Compound and Biology Data

The purpose of MDS is dimension reduction to visualize data in high dimensional space into low dimensional space. We will demonstrate the highly scalable parallel implementation of SMACOF, which is an MDS algorithm based on EM like iterative method, using MPI and C# language. Using the MPI_SMACOF implementation, we visualize two real application data of chemical compound and biological data, which are in 155-dimensional space and 4786–dimensional space respectively, into 3–dimensional space.

MDS Visualization of Medical Geographic’s Data for Risk of Childhood Obesity

Obesity epidemic is a well-documented public health problem in the United States. The dramatic increase in childhood obesity raises concern about the health of these youth as they approach adulthood.  Environmental conditions have been identified as intervening factors through their impact on physical activity and eating habits.  MDS provides a visualization method, which can easily integrate with other clustering models, for analysis of the structure of medical Geographic’s data and relationship between obesity and environment.

Pairwise Data Clustering (PDC) of Biology Gene Family Data

Identifying dissimilarity is critical to the discovery of hidden structure from large gene (or genome) sequences. PDC is particularly suited for processing gene data as it keeps distance ratio (structure–preserving) at reduction from high to low dimensional Euclidian space. When used with Deterministic Annealing approach, PDC achieves robustness of maximum entropy inference.  We parallelize the algorithm in C# to run on multicore clusters using CCR and MPI. Our preliminary results in MDS show 3D distributions of 9000 Alu genome sequences.

Large scale data analysis of High Energy Physics (HEP) from Caltech

We will present the design and results from highly scalable parallel implementation using CGL–MapReduce and compare it with a Hadoop implementation. The challenges we faced include converting a set of data analysis scripts written using ROOT (C++) into a MapReduce implementation.  The data (1 Terabytes) and the necessary analysis scripts used in this evaluation are obtained from particle physics experiments of Center for Advanced Computing Research at Caltech.

Deterministic Annealing Clustering of Web Data

Amount of data in the Internet powered by people’s intelligence, especially including rating and social bookmarking—is getting growing. The analysis of such data can lead us to discover hidden knowledge but its huge size of data to process remains challenging in many machine learning algorithms. We have explored the possible algorithms for data analysis and its computational efficiencies by using multicore/multiprocess technologies. Especially, we demonstrate the analysis of social bookmarking data and Netflix movie rating data by using the parallelized deterministic annealing clustering algorithm.

Performance Benchmark of CCR, DSS, MPI

We examine the issues of building a multi–paradigm runtime with the three execution styles (dynamic threading, MPI, coarse grain functional parallelism). Microsoft CCR supports efficient thread management for handlers (continuations) spawned in response to messages being posted to ports. Microsoft DSS is a lightweight, service oriented application model that sits on top of CCR for creating coarse grain application in a distributed environment. MPI benchmark includes MPICH2 and Nemesis, mpiJava, and MPJE.  Tests are implemented in C#, C++, and Java languages and run on multicore systems with a variety of OS (Windows XP, Windows Vista, RedHat and Fedora).

Performance Benchmark of CGL_MapReduce, MPI, and Hadoop

To evaluate different programming models for large scale parallel computation, we’ve conducted a set of performance measurements. These include MDS using CGL–MapReduce on generated data of 4096 4D points and 26000 genome elements; Kmeans implementation using MPI, Hadoop and CGL–MapReduce on 40 million 2D data points; 8192 by 8192 matrix multiplication implementation using CGL–MapReduce, and Hadoop.

More Information

Show your support for e-Science 2008

Add one of our badges to your site:

  • Teal eScience 2008 Web badge
  • Green eScience 2008 Web badge
  • Orange eScience 2008 Web badge