Benchmarking performance of document level classification and topic modeling

Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classification and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text cla...

Full description

Saved in:
Bibliographic Details
Main Authors: Bhatti, Muhammad Shahid, Ullah, Azmat, Latip, Rohaya, Sohail, Abid, Riaz, Anum, Hassan, Rohail
Format: Article
Published: 1546-2218; ESSN: 1546-2226 2021
Online Access:http://psasir.upm.edu.my/id/eprint/96189/
https://www.techscience.com/cmc/v71n1/45375
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.96189
record_format eprints
spelling my.upm.eprints.961892023-01-31T03:13:04Z http://psasir.upm.edu.my/id/eprint/96189/ Benchmarking performance of document level classification and topic modeling Bhatti, Muhammad Shahid Ullah, Azmat Latip, Rohaya Sohail, Abid Riaz, Anum Hassan, Rohail Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classification and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text classification has increased over time. However, Urdu language has not much experimented with research, it does not have readily available datasets, which turn out to be the primary reason behind limited research and applying the latest methodologies to the Urdu. To overcome these obstacles, a medium-sized dataset having six categories is collected from authentic Pakistani news sources. Urdu is a rich but complex language. Text processing can be challenging for Urdu due to its complex features as compared to other languages. Term frequency-inverse document frequency (TFIDF) based term weighting scheme for extracting features, chi-2 for selecting essential features, and Linear discriminant analysis (LDA) for dimensionality reduction have been used. TFIDF matrix and cosine similarity measure have been used to identify similar documents in a collection and find the semantic meaning of words in a document FastText model has been applied. The training-test split evaluation methodology is used for this experimentation, which includes 70% for training data and 30% for testing data. State-of-the-art machine learning and deep dense neural network approaches for Urdu news classification have been used. Finally, we trained Multinomial Naïve Bayes, XGBoost, Bagging, and Deep dense neural network. Bagging and deep dense neural network outperformed the other algorithms. The experimental results show that deep dense achieves 92.0% mean f1 score, and Bagging 95.0% f1 score. 1546-2218; ESSN: 1546-2226 2021 Article PeerReviewed Bhatti, Muhammad Shahid and Ullah, Azmat and Latip, Rohaya and Sohail, Abid and Riaz, Anum and Hassan, Rohail (2021) Benchmarking performance of document level classification and topic modeling. CMC-Computers Materials & Continua, 71 (1). pp. 1-15. ISSN Tech Science Press https://www.techscience.com/cmc/v71n1/45375 10.32604/cmc.2022.020083
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
description Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classification and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text classification has increased over time. However, Urdu language has not much experimented with research, it does not have readily available datasets, which turn out to be the primary reason behind limited research and applying the latest methodologies to the Urdu. To overcome these obstacles, a medium-sized dataset having six categories is collected from authentic Pakistani news sources. Urdu is a rich but complex language. Text processing can be challenging for Urdu due to its complex features as compared to other languages. Term frequency-inverse document frequency (TFIDF) based term weighting scheme for extracting features, chi-2 for selecting essential features, and Linear discriminant analysis (LDA) for dimensionality reduction have been used. TFIDF matrix and cosine similarity measure have been used to identify similar documents in a collection and find the semantic meaning of words in a document FastText model has been applied. The training-test split evaluation methodology is used for this experimentation, which includes 70% for training data and 30% for testing data. State-of-the-art machine learning and deep dense neural network approaches for Urdu news classification have been used. Finally, we trained Multinomial Naïve Bayes, XGBoost, Bagging, and Deep dense neural network. Bagging and deep dense neural network outperformed the other algorithms. The experimental results show that deep dense achieves 92.0% mean f1 score, and Bagging 95.0% f1 score.
format Article
author Bhatti, Muhammad Shahid
Ullah, Azmat
Latip, Rohaya
Sohail, Abid
Riaz, Anum
Hassan, Rohail
spellingShingle Bhatti, Muhammad Shahid
Ullah, Azmat
Latip, Rohaya
Sohail, Abid
Riaz, Anum
Hassan, Rohail
Benchmarking performance of document level classification and topic modeling
author_facet Bhatti, Muhammad Shahid
Ullah, Azmat
Latip, Rohaya
Sohail, Abid
Riaz, Anum
Hassan, Rohail
author_sort Bhatti, Muhammad Shahid
title Benchmarking performance of document level classification and topic modeling
title_short Benchmarking performance of document level classification and topic modeling
title_full Benchmarking performance of document level classification and topic modeling
title_fullStr Benchmarking performance of document level classification and topic modeling
title_full_unstemmed Benchmarking performance of document level classification and topic modeling
title_sort benchmarking performance of document level classification and topic modeling
publisher 1546-2218; ESSN: 1546-2226
publishDate 2021
url http://psasir.upm.edu.my/id/eprint/96189/
https://www.techscience.com/cmc/v71n1/45375
_version_ 1756685781631500288
score 13.211869