Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018

Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing o...

Full description

Saved in:
Bibliographic Details
Main Authors: Anis Nadiah Che Abdul Rahman,, Imran Ho Abdullah,, Intan Safinaz Zainudin,, Sabrina Tiun,, Azhar Jaludin,
Format: Article
Language:English
Published: Penerbit Universiti Kebangsaan Malaysia 2021
Online Access:http://journalarticle.ukm.my/17253/1/45125-156850-1-PB.pdf
http://journalarticle.ukm.my/17253/
https://ejournal.ukm.my/gema/issue/view/1397
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-ukm.journal.17253
record_format eprints
spelling my-ukm.journal.172532021-08-03T05:57:49Z http://journalarticle.ukm.my/17253/ Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018 Anis Nadiah Che Abdul Rahman, Imran Ho Abdullah, Intan Safinaz Zainudin, Sabrina Tiun, Azhar Jaludin, Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing of Malay parliamentary discourse. In this paper, we propose a semantic approach towards identifying and removing Malay, conventional Malay spelling and English functional words in analysing a time-series corpus, namely the Malaysian Hansard Corpus (MHC), to extract a Malay specific-domain stop word list. The study utilised a combination of Z-method of most frequently occurring words, words that appear once, and the classic method. The dataset of the corpus evaluated comprised Parliament 1 (year 1959) to Parliament 13 (year 2018). The study then categorised the stop word list according to domain-specific related words. The resulting list comprised 587 stop words. New stop words that emerged from the MHC include parliamentary-related words like ‘Berhormat’ (salutation to the members of the Parliament), ‘Pertua’ (salutation to the Speaker of the House), ‘ketawa’ (laugh) and ‘tepuk’ (clap). Other than typical English stop words like ‘and’ and ‘the’, there are also words like ‘hon’ble’ (short for ‘Honourable’) and ‘honourable’. The list also includes stop words in conventional Malay spelling like ‘untok’ (for), ‘lebeh’ (more), and ‘kapada’ (to). The proposed set of stop words can be further utilised to assist natural language processing and text analysis. Penerbit Universiti Kebangsaan Malaysia 2021-05 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/17253/1/45125-156850-1-PB.pdf Anis Nadiah Che Abdul Rahman, and Imran Ho Abdullah, and Intan Safinaz Zainudin, and Sabrina Tiun, and Azhar Jaludin, (2021) Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018. GEMA: Online Journal of Language Studies, 21 (2). pp. 1-27. ISSN 1675-8021 https://ejournal.ukm.my/gema/issue/view/1397
institution Universiti Kebangsaan Malaysia
building Tun Sri Lanang Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Kebangsaan Malaysia
content_source UKM Journal Article Repository
url_provider http://journalarticle.ukm.my/
language English
description Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing of Malay parliamentary discourse. In this paper, we propose a semantic approach towards identifying and removing Malay, conventional Malay spelling and English functional words in analysing a time-series corpus, namely the Malaysian Hansard Corpus (MHC), to extract a Malay specific-domain stop word list. The study utilised a combination of Z-method of most frequently occurring words, words that appear once, and the classic method. The dataset of the corpus evaluated comprised Parliament 1 (year 1959) to Parliament 13 (year 2018). The study then categorised the stop word list according to domain-specific related words. The resulting list comprised 587 stop words. New stop words that emerged from the MHC include parliamentary-related words like ‘Berhormat’ (salutation to the members of the Parliament), ‘Pertua’ (salutation to the Speaker of the House), ‘ketawa’ (laugh) and ‘tepuk’ (clap). Other than typical English stop words like ‘and’ and ‘the’, there are also words like ‘hon’ble’ (short for ‘Honourable’) and ‘honourable’. The list also includes stop words in conventional Malay spelling like ‘untok’ (for), ‘lebeh’ (more), and ‘kapada’ (to). The proposed set of stop words can be further utilised to assist natural language processing and text analysis.
format Article
author Anis Nadiah Che Abdul Rahman,
Imran Ho Abdullah,
Intan Safinaz Zainudin,
Sabrina Tiun,
Azhar Jaludin,
spellingShingle Anis Nadiah Che Abdul Rahman,
Imran Ho Abdullah,
Intan Safinaz Zainudin,
Sabrina Tiun,
Azhar Jaludin,
Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
author_facet Anis Nadiah Che Abdul Rahman,
Imran Ho Abdullah,
Intan Safinaz Zainudin,
Sabrina Tiun,
Azhar Jaludin,
author_sort Anis Nadiah Che Abdul Rahman,
title Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
title_short Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
title_full Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
title_fullStr Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
title_full_unstemmed Domain-specific stop words in Malaysian Parliamentary Debates 1959 – 2018
title_sort domain-specific stop words in malaysian parliamentary debates 1959 – 2018
publisher Penerbit Universiti Kebangsaan Malaysia
publishDate 2021
url http://journalarticle.ukm.my/17253/1/45125-156850-1-PB.pdf
http://journalarticle.ukm.my/17253/
https://ejournal.ukm.my/gema/issue/view/1397
_version_ 1707766995892568064
score 13.18916