Word Sense Disambiguation for Web 2.0 Data
Abstract
In this work we plan to create an architecture that will allow for a variety of parallel similarity and parallel clustering algorithms to be tested and developed to be run against Web 2.0 data. These algorithms will be used to analyze emerging semantics and word senses within the data.
Intellectual Merit
User generated data on the Web is but one example of where researchers are seeing the challenges of "big data." This data phenomena can be described as a problem of where large datasets are being generated and updated to scales where it becomes difficult to store, manage, and visualize among other challenges. This project will allow students and researchers to investigate the challenges of big data from a computer science and engineering perspective. The goal of this project is to specifically investigate a natural language processing problem (word sense disambiguation) that will provide results to the specific problem as well as provide information to the greater context of the big data paradigm. The project is supported by two faculty members and a Ph.D. student in computer science. Insight gained from this project will benefit the following research communities: natural language processing, information modeling, as well as cloud and grid computing.
Broader Impact
The broader impact of this project is to provide a Ph.D. student a dissertation topic that can then be expanded into future teachings for students at Indiana University. The project ties well into Indiana's School of Informatics and Computing mission teaching and researching computing and information technology topics while integrating these topics into scientific and human issues. The results of this project will allow other institutions to utilize the methodologies and framework to perform the same experiments.
Use of FutureGrid
Investigate emerging semantics in natural language web 2.0 data.
Scale Of Use
Around ten VMs to run experiments. We will use these VMs many times over the course of a couple of months to test a variety of algorithms.
Publications
- [fg-2292] Klinginsmith, J., M. Mahoui, and Y. M. Wu, "Towards Reproducible eScience in the Cloud", November
Results
(http://www.ds.unipi.gr/cloudcom2011/program/accepted-papers.html).
In this work, we demonstrated the following:
- The construction of scalable computing environments into two distinct layers: (1) the infrastructure layer and (2) the software layer.
- A demonstration through this separation of concerns that the installation and configuration operations performed within the software layer can be re-used in separate clouds.
- The creation of two distinct types of computational clusters, utilizing the framework.
- Two fully reproducible eScience experiments built on top of the framework.