ILS-Z604 Big Data Analytics for Web and Text - SP14 Group #2

Abstract

FutureGrid resources (Eucalyptus Hadoop/MapReduce) will facilitate the completion of a data science project for ILS-Z604 at Indiana University. We propose to apply topic modeling algorithms to the collection of documents within a dataset from the HathiTrust Research Center. While we will likely begin by performing word frequency counts on the data, we ultimately are looking to identify patterns and semantic meaning in the corpus. Depending on the scope of the assignments and available resources, we may also create visualization to aid in the communication of topical relationships.

Intellectual Merit

FutureGrid resources will further the course learning outcomes:
--The most important and current grounding philosophies, theories, and models for data
science
--How to view real-world problems from lens of these theories and models and solve these
problems using the data science perspective (case studies)
--Basic data processing and statistical analysis methods
--Basic machine learning, data retrieval, ranking, and recommendation algorithms
--Basics of R, Lucene, Hadoop and NoSQL (MongoDB), in lab sessions

Broader Impact

By completing the course-based research project, the team may identify opportunities for identify and correcting errors resulting from the optical character recognition process, and apply a working solution to a large corpus (250,000 volumes, ~500GB). Results and methods will be documented in a final paper or in-class presentation.

Use of FutureGrid

We will use Eucalyptus on FutureGrid to run Hadoop/MapReduce jobs.

Scale Of Use

We will use the system intermittently for the remainder of the spring 2014 semester.

Publications


Results


FG-411
Trevor Edelblute
Indiana University
Active

Project Members

Siyuan Guo

Timeline

8 weeks 52 min ago