Word Sense Disambiguation for Web 2.0 Data

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

In this work we plan to create an architecture that will allow for a variety of parallel similarity and parallel clustering algorithms to be tested and developed to be run against Web 2.0 data. These algorithms will be used to analyze emerging semantics and word senses within the data.

Intellectual Merit

User generated data on the Web is but one example of where researchers are seeing the challenges of "big data." This data phenomena can be described as a problem of where large datasets are being generated and updated to scales where it becomes difficult to store, manage, and visualize among other challenges. This project will allow students and researchers to investigate the challenges of big data from a computer science and engineering perspective. The goal of this project is to specifically investigate a natural language processing problem (word sense disambiguation) that will provide results to the specific problem as well as provide information to the greater context of the big data paradigm. The project is supported by two faculty members and a Ph.D. student in computer science. Insight gained from this project will benefit the following research communities: natural language processing, information modeling, as well as cloud and grid computing.

Broader Impacts

The broader impact of this project is to provide a Ph.D. student a dissertation topic that can then be expanded into future teachings for students at Indiana University. The project ties well into Indiana's School of Informatics and Computing mission teaching and researching computing and information technology topics while integrating these topics into scientific and human issues. The results of this project will allow other institutions to utilize the methodologies and framework to perform the same experiments.

Project Contact

Project Lead
Jonathan Klinginsmith (jklingin) 
Project Manager
Jonathan Klinginsmith (jklingin) 

Resource Requirements

Hardware Systems
  • india (IBM iDataPlex at IU)
  • xray (Cray XM5 at IU)
 
Use of FutureGrid

Investigate emerging semantics in natural language web 2.0 data.

Scale of Use

Around ten VMs to run experiments. We will use these VMs many times over the course of a couple of months to test a variety of algorithms.

Project Timeline

Submitted
11/02/2010 - 10:40