Use Hadoop to find popular words of open source Java codes

Project Information

Discipline
Computer Science (401) 
Orientation
Research 
Abstract

Open source projects contribute so many free codes. Some functional-similar codes might share common “codes” or “words”, such as class name, variable name, and method name. Based on those common features, we might help programmer to find out other’s most-related codes for reference. Find similar codes might be difficult; find common words is easier. This project is to find out the popular words used in open source Java projects as a first step for above object. We use Hadoop to do the word counting job since it is convenient and could easily scale to bigger data set.

Intellectual Merit

Until now we have no Intellectual Merit, since this is a course project.

Broader Impacts

This project might help programmer to find out functional related open source Java codes, which will convenient their coding and debugging.

Project Contact

Project Lead
Lin Liu (linliu) 
Project Manager
Lin Liu (linliu) 

Resource Requirements

Hardware System
  • hotel (IBM iDataPlex at U Chicago)
 
Use of FutureGrid

Since FutureGrid provide myHadoop which is a good way to map and reduce, so I will apply for around 20 nodes in FutureGrid as the project platform and run myHadoop on it.

Scale of Use

I need apply for around 20 VMS for this purpose.

Project Timeline

Submitted
12/10/2013 - 21:53