Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics

This research addresses a number of important issues involved in performing Sentiment Analysis (SA) on Malaysian Social Media (SM), including an analysis of bilingual or mixed language, choice of sentiment lexicon, normalisation heuristics, and the use of public datasets. This work is the first to q...

Full description

Saved in:
Bibliographic Details
Main Authors: James Mountstephens, Tan, Mathieson Zui Quen, Lai, Po Hung
Format: Article
Language:English
English
Published: Journal of Theoretical and Applied Information Technology 2023
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/39226/1/ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/39226/2/FULL%20TEXT.pdf
https://eprints.ums.edu.my/id/eprint/39226/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.ums.eprints.39226
record_format eprints
spelling my.ums.eprints.392262024-07-19T08:13:59Z https://eprints.ums.edu.my/id/eprint/39226/ Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics James Mountstephens Tan, Mathieson Zui Quen Lai, Po Hung QA76.75-76.765 Computer software T10.5-11.9 Communication of technical information This research addresses a number of important issues involved in performing Sentiment Analysis (SA) on Malaysian Social Media (SM), including an analysis of bilingual or mixed language, choice of sentiment lexicon, normalisation heuristics, and the use of public datasets. This work is the first to quantify the level of language mixing in informal Malaysian text. Analysis of the 2M tweet Malaya dataset revealed a significant level of English sentiment content in Malaysian social media (13.5%), demonstrating the neccessity of a bilingual approach to Malaysian Sentiment Analysis. Significant patterns in noisy Malaysian SM text were identified and heuristics for normalising them were devised. The popular and effective English lexicon-based SA system VADER (Valence Aware Dictionary and sEntiment Reasoner) was translated to Malay using automatic and manual methods, with the combination of English and Malay VADER yielding a bilingual SA system. A subset of the Malaya dataset was both corrected and extended from two to three classes in order to properly test the bilingual SA system. Bilingual VADER with normalisation heuristics was able to achieve an impressive level of performance on a three-class problem (accuracy=0.71, mean F1=0.72), as compared to Malay VADER alone and several popular machine learning-based algorithms. Journal of Theoretical and Applied Information Technology 2023 Article NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/39226/1/ABSTRACT.pdf text en https://eprints.ums.edu.my/id/eprint/39226/2/FULL%20TEXT.pdf James Mountstephens and Tan, Mathieson Zui Quen and Lai, Po Hung (2023) Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics. Journal of Theoretical and Applied Information Technology, 101 (12). pp. 5037-5050. ISSN 1992-8645
institution Universiti Malaysia Sabah
building UMS Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sabah
content_source UMS Institutional Repository
url_provider http://eprints.ums.edu.my/
language English
English
topic QA76.75-76.765 Computer software
T10.5-11.9 Communication of technical information
spellingShingle QA76.75-76.765 Computer software
T10.5-11.9 Communication of technical information
James Mountstephens
Tan, Mathieson Zui Quen
Lai, Po Hung
Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics
description This research addresses a number of important issues involved in performing Sentiment Analysis (SA) on Malaysian Social Media (SM), including an analysis of bilingual or mixed language, choice of sentiment lexicon, normalisation heuristics, and the use of public datasets. This work is the first to quantify the level of language mixing in informal Malaysian text. Analysis of the 2M tweet Malaya dataset revealed a significant level of English sentiment content in Malaysian social media (13.5%), demonstrating the neccessity of a bilingual approach to Malaysian Sentiment Analysis. Significant patterns in noisy Malaysian SM text were identified and heuristics for normalising them were devised. The popular and effective English lexicon-based SA system VADER (Valence Aware Dictionary and sEntiment Reasoner) was translated to Malay using automatic and manual methods, with the combination of English and Malay VADER yielding a bilingual SA system. A subset of the Malaya dataset was both corrected and extended from two to three classes in order to properly test the bilingual SA system. Bilingual VADER with normalisation heuristics was able to achieve an impressive level of performance on a three-class problem (accuracy=0.71, mean F1=0.72), as compared to Malay VADER alone and several popular machine learning-based algorithms.
format Article
author James Mountstephens
Tan, Mathieson Zui Quen
Lai, Po Hung
author_facet James Mountstephens
Tan, Mathieson Zui Quen
Lai, Po Hung
author_sort James Mountstephens
title Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics
title_short Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics
title_full Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics
title_fullStr Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics
title_full_unstemmed Bilingual sentiment analysis on Malaysian social media using vader and normalisation heuristics
title_sort bilingual sentiment analysis on malaysian social media using vader and normalisation heuristics
publisher Journal of Theoretical and Applied Information Technology
publishDate 2023
url https://eprints.ums.edu.my/id/eprint/39226/1/ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/39226/2/FULL%20TEXT.pdf
https://eprints.ums.edu.my/id/eprint/39226/
_version_ 1805887935540625408
score 13.19449