ILS-Z604 Big Data Analytics for Web and Text - SP14 Group #2

Project Information

Discipline
Computer Science (401) 
Orientation
Education 
Abstract

FutureGrid resources (Eucalyptus Hadoop/MapReduce) will facilitate the completion of a data science project for ILS-Z604 at Indiana University. We propose to apply topic modeling algorithms to the collection of documents within a dataset from the HathiTrust Research Center. While we will likely begin by performing word frequency counts on the data, we ultimately are looking to identify patterns and semantic meaning in the corpus. Depending on the scope of the assignments and available resources, we may also create visualization to aid in the communication of topical relationships.

Intellectual Merit

FutureGrid resources will further the course learning outcomes: --The most important and current grounding philosophies, theories, and models for data science --How to view real-world problems from lens of these theories and models and solve these problems using the data science perspective (case studies) --Basic data processing and statistical analysis methods --Basic machine learning, data retrieval, ranking, and recommendation algorithms --Basics of R, Lucene, Hadoop and NoSQL (MongoDB), in lab sessions

Broader Impacts

By completing the course-based research project, the team may identify opportunities for identify and correcting errors resulting from the optical character recognition process, and apply a working solution to a large corpus (250,000 volumes, ~500GB). Results and methods will be documented in a final paper or in-class presentation.

Project Contact

Project Lead
Trevor Edelblute (tedelblu) 
Project Manager
Trevor Edelblute (tedelblu) 
Project Members
Siyuan Guo  

Resource Requirements

Hardware System
  • india (IBM iDataPlex at IU)
 
Use of FutureGrid

We will use Eucalyptus on FutureGrid to run Hadoop/MapReduce jobs.

Scale of Use

We will use the system intermittently for the remainder of the spring 2014 semester.

Project Timeline

Submitted
02/08/2014 - 14:54