Computational Morphological Resources Management System
In Natural Language Processing (NLP), morphological analyser is one of a very basic processing tool that we need to have. It is because with the help of the morphological analyser a word structure could be studied. In order to analyse a word structure, morphological resources is a very crucial input...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project Report |
Language: | English English |
Published: |
Universiti Malaysia Sarawak, (UNIMAS)
2014
|
Subjects: | |
Online Access: | http://ir.unimas.my/id/eprint/39301/1/JOVIANNA%20%2824%20pgs%29.pdf http://ir.unimas.my/id/eprint/39301/4/JOVIANNA%20%28fulltext%29.pdf http://ir.unimas.my/id/eprint/39301/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In Natural Language Processing (NLP), morphological analyser is one of a very basic processing tool that we need to have. It is because with the help of the morphological analyser a word structure could be studied. In order to analyse a word structure, morphological resources is a very crucial input for the morphological analyser. Currently, the acquisition of morphological resources is done manually which consumes a lot of energy and time. Therefore, we proposed Computational Morphological Resources Management System (CMRMS), a management system that will ease the linguist when undergoing the
pre-processing part. Besides, CMRMS would allow the linguist to induce morphological information from the obtained wordlist. Therefore, to overcome the time and energy consuming problem an automated way is developed. The automated way combines the
manual pre-processing and automatic file management system as the solution to obtain a wordlist and segmented data. The automated system, CMRMS has three main modules which are tokenization, conversion and segmentation tools module. .
The tokenization module will tokenize any text file data which is obtain from hardcopy data, softcopy data and existing data into word by word. The conversion module would convert two types of softcopy data which is a pdf file and html file. Lastly, the segmentation tools module will provide two segmentation tools called Linguistica and Morfessor to analyse the data which have been tokenized. In order to test the functionality of CMRMS, three types of testing was
implemented which are system, component and integration testing. Each of the testing gave a good result as the result shows CMRMS able to obtain the acquired result. This system has
helped the linguist to manage their time more efficiently since they do not have to undergo the pre-processing part manual. Using CMRMS, they can obtain the wordlist easily. Beside,
the produced wordlist can be re-used again as the input for other segmentation process. |
---|