SENIOR PROJECTS for 1999-2000

The following senior projects areavailable from Kemal Oflazer. Send mail (ko@cs.bilkent.edu.tr) if you are interested or need more information.

Segmentation of Turkish text into sentences:

Identifying sentence boundaries in a text is very important in many natural language processing and text processing applications. Yet, it is a quite non-trivial problem since punctuation used to mark sentence boundaries have other uses (e.g., not every "." ends a sentence, see "e.g." earlier!). Your work in this project will involve a review of recent work on this topic , and apply some of those and perhaps new techniques to this problem for Turkish. New work will probably involve developing some machine learning techniques. At the hand you will have developed a tool will will input a tokenized Turkish text and produce the same text at the output but with the addition of symbols marking sentence boundaries.

Segmentation of sentences into chunks:

Analysis of sentences which are very long ( for example > 30 words) are very time consuming and error prone, and usually there are possibly hundreds or thousands of analyses of which only one is "correct". It may usually be possible to pre-partition a sentence into manageable size chunks without using a grammar but using local heuristic knowledge. For instance in the sentence, "

Eger Italya, Türk adaletinin kararlarindan endise duyuyor idiyse, Öcalan'i geçen Kasim ayinda yakalamisti, onu o zaman o yargilayabilirdi.

each colored block of text is a reasonable chunk which is more or less self contained with well defined relationships to the other chunks. Once partitioned, analyzing the 3 chunks is a much more easier task than analyzing the complete sentence with a complete grammar. This project involves reading about earlier work on the subject, developing a catalog of features for identifying chunk boundaries and developing machine learning/classification approaches to identify chunk boundaries and evaluate the how correct the results are.

Developing a corpus annotation tool:

This project developing the user interface and functionality of a program for developing a Turkish Treebank, a text database. The development is expected to be done in Java for maximum portability. The program will be able to load raw text and and formatted database and allow the user to annotate or delete portions of associated information. For more information about the text database, please refer to Design for a Turkish Treebank (Oflazer, Hakkani-Tür, Tür, in Proceedings of Workshop on Linguistically Interpreted Corpora, EACL'99, Bergen, Norway, June 1999)

Developing a finite state noun phrase extractor for Turkish text:

Identification of noun phrases is an important step in analyzing sentences. This project involves developing a highly accurate finite state parser for identifying noun phrases in Turkish text. You will use state of the art software to develop large scale finite state machines which will analyze sentences and extract noun phrases with a high accuracy.

Kemal Oflazer