ILS-Z604 Big Data Analytics for Web and Text - SP14 Group #2
Project Information
- Discipline
- Computer Science (401)
- Orientation
- Education
FutureGrid resources (Eucalyptus Hadoop/MapReduce) will facilitate the completion of a data science project for ILS-Z604 at Indiana University. We propose to apply topic modeling algorithms to the collection of documents within a dataset from the HathiTrust Research Center. While we will likely begin by performing word frequency counts on the data, we ultimately are looking to identify patterns and semantic meaning in the corpus. Depending on the scope of the assignments and available resources, we may also create visualization to aid in the communication of topical relationships.
Intellectual MeritFutureGrid resources will further the course learning outcomes: --The most important and current grounding philosophies, theories, and models for data science --How to view real-world problems from lens of these theories and models and solve these problems using the data science perspective (case studies) --Basic data processing and statistical analysis methods --Basic machine learning, data retrieval, ranking, and recommendation algorithms --Basics of R, Lucene, Hadoop and NoSQL (MongoDB), in lab sessions
Broader ImpactsBy completing the course-based research project, the team may identify opportunities for identify and correcting errors resulting from the optical character recognition process, and apply a working solution to a large corpus (250,000 volumes, ~500GB). Results and methods will be documented in a final paper or in-class presentation.
Project Contact
- Project Lead
- Trevor Edelblute (tedelblu)
- Project Manager
- Trevor Edelblute (tedelblu)
- Project Members
- Siyuan Guo
Resource Requirements
- Hardware System
-
- india (IBM iDataPlex at IU)
We will use Eucalyptus on FutureGrid to run Hadoop/MapReduce jobs.
Scale of UseWe will use the system intermittently for the remainder of the spring 2014 semester.
Project Timeline
- Submitted
- 02/08/2014 - 14:54