Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing o...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Penerbit Universiti Kebangsaan Malaysia
2021
|
Online Access: | http://journalarticle.ukm.my/17253/1/45125-156850-1-PB.pdf http://journalarticle.ukm.my/17253/ https://ejournal.ukm.my/gema/issue/view/1397 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Removal of stop words is essential in Natural Language Processing and text-related analysis.
Existing works on Malay stop words are based on standard Malay and Quranic/Arabic
translations into Malay. Thus, there is a lack of domain-specific stop word list, making it
discordant for processing of Malay parliamentary discourse. In this paper, we propose a
semantic approach towards identifying and removing Malay, conventional Malay spelling and
English functional words in analysing a time-series corpus, namely the Malaysian Hansard
Corpus (MHC), to extract a Malay specific-domain stop word list. The study utilised a
combination of Z-method of most frequently occurring words, words that appear once, and the
classic method. The dataset of the corpus evaluated comprised Parliament 1 (year 1959) to
Parliament 13 (year 2018). The study then categorised the stop word list according to domain-specific related words. The resulting list comprised 587 stop words. New stop words that
emerged from the MHC include parliamentary-related words like ‘Berhormat’ (salutation to
the members of the Parliament), ‘Pertua’ (salutation to the Speaker of the House), ‘ketawa’
(laugh) and ‘tepuk’ (clap). Other than typical English stop words like ‘and’ and ‘the’, there are
also words like ‘hon’ble’ (short for ‘Honourable’) and ‘honourable’. The list also includes stop
words in conventional Malay spelling like ‘untok’ (for), ‘lebeh’ (more), and ‘kapada’ (to). The
proposed set of stop words can be further utilised to assist natural language processing and text
analysis. |
---|