Generic named-entity recognition for indigenous languages of Sarawak (Nersil)

The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs...

Full description

Saved in:
Bibliographic Details
Main Author: Yong, Soo Fong
Format: Thesis
Language:English
Published: Universiti Malaysia Sarawak, (UNIMAS) 2013
Subjects:
Online Access:http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf
http://ir.unimas.my/id/eprint/8340/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.unimas.ir.8340
record_format eprints
spelling my.unimas.ir.83402023-05-25T09:43:08Z http://ir.unimas.my/id/eprint/8340/ Generic named-entity recognition for indigenous languages of Sarawak (Nersil) Yong, Soo Fong T Technology (General) The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs considered in this research are Person, Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs carry important information about the text itself. Thus, there are targets for extraction. NER approaches can be categorised broadly as rule-based approach, machine learningbased approach, and hybrid approach. Rule-based approach relies on hand-crafted linguistic grammars. Machine learning-based approach needs a large amount of annotated training data, which is unavailable for SILs. Hybrid approach is the combination of rulebased and machine learning-based approach. NERSIL requires special attention as it is impossible to apply directly from the existing NER approaches. In this thesis, an NER system that is built by extending and modifying the existing NER approaches is presented. There are three main processes: the non-modified ANNIE (A Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context investigation. Firstly, the input texts are submitted to an English NER, in this case ANNIE with the assumption that some NEs that appear in English texts will also occur in SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new gazetteers for SILs are built in order to identify more NEs. However, the first two v processes are not enough to provide a good accuracy in recognising all NEs. Thus, context investigation is needed. Context investigation includes frequency analysis, triggered words filtering, and concordance analysis. The context of a NE (the left or right side of NE) will be investigated.Finally, a NER system designed for SILs will be an advancement of world knowledge. Besides, the design can be improved by incorporating the machine translation, WordNet, and adding more noise filtering (e.g. context filtering, and morphological filtering). With more research and future studies, this NER system will reach a high level of performance like the English NER work on. Universiti Malaysia Sarawak, (UNIMAS) 2013 Thesis NonPeerReviewed text en http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf Yong, Soo Fong (2013) Generic named-entity recognition for indigenous languages of Sarawak (Nersil). Masters thesis, Universiti Malaysia Sarawak, (UNIMAS).
institution Universiti Malaysia Sarawak
building Centre for Academic Information Services (CAIS)
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sarawak
content_source UNIMAS Institutional Repository
url_provider http://ir.unimas.my/
language English
topic T Technology (General)
spellingShingle T Technology (General)
Yong, Soo Fong
Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
description The aim of this research is to create the first Named Entity Recognition (NER) system for the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal of NERSIL is to achieve a good accuracy with regard to the identification and classification of named entities (NEs). The NEs considered in this research are Person, Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs carry important information about the text itself. Thus, there are targets for extraction. NER approaches can be categorised broadly as rule-based approach, machine learningbased approach, and hybrid approach. Rule-based approach relies on hand-crafted linguistic grammars. Machine learning-based approach needs a large amount of annotated training data, which is unavailable for SILs. Hybrid approach is the combination of rulebased and machine learning-based approach. NERSIL requires special attention as it is impossible to apply directly from the existing NER approaches. In this thesis, an NER system that is built by extending and modifying the existing NER approaches is presented. There are three main processes: the non-modified ANNIE (A Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context investigation. Firstly, the input texts are submitted to an English NER, in this case ANNIE with the assumption that some NEs that appear in English texts will also occur in SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new gazetteers for SILs are built in order to identify more NEs. However, the first two v processes are not enough to provide a good accuracy in recognising all NEs. Thus, context investigation is needed. Context investigation includes frequency analysis, triggered words filtering, and concordance analysis. The context of a NE (the left or right side of NE) will be investigated.Finally, a NER system designed for SILs will be an advancement of world knowledge. Besides, the design can be improved by incorporating the machine translation, WordNet, and adding more noise filtering (e.g. context filtering, and morphological filtering). With more research and future studies, this NER system will reach a high level of performance like the English NER work on.
format Thesis
author Yong, Soo Fong
author_facet Yong, Soo Fong
author_sort Yong, Soo Fong
title Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_short Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_full Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_fullStr Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_full_unstemmed Generic named-entity recognition for indigenous languages of Sarawak (Nersil)
title_sort generic named-entity recognition for indigenous languages of sarawak (nersil)
publisher Universiti Malaysia Sarawak, (UNIMAS)
publishDate 2013
url http://ir.unimas.my/id/eprint/8340/3/Generic%20Named-Entity%20Recognition%20For%20Indigenous%20Languages%20of%20Sarawak%20%28NERSIL%29%20%28full%29.pdf
http://ir.unimas.my/id/eprint/8340/
_version_ 1767209777152131072
score 13.1944895