Staff View: Benchmarking performance of document level classification and topic modeling

Benchmarking performance of document level classification and topic modeling

Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classiﬁcation and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text cla...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bhatti, Muhammad Shahid, Ullah, Azmat, Latip, Rohaya, Sohail, Abid, Riaz, Anum, Hassan, Rohail
Format:	Article
Published:	1546-2218; ESSN: 1546-2226 2021
Online Access:	http://psasir.upm.edu.my/id/eprint/96189/ https://www.techscience.com/cmc/v71n1/45375
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.upm.eprints.96189
record_format	eprints
spelling	my.upm.eprints.961892023-01-31T03:13:04Z http://psasir.upm.edu.my/id/eprint/96189/ Benchmarking performance of document level classification and topic modeling Bhatti, Muhammad Shahid Ullah, Azmat Latip, Rohaya Sohail, Abid Riaz, Anum Hassan, Rohail Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classiﬁcation and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text classiﬁcation has increased over time. However, Urdu language has not much experimented with research, it does not have readily available datasets, which turn out to be the primary reason behind limited research and applying the latest methodologies to the Urdu. To overcome these obstacles, a medium-sized dataset having six categories is collected from authentic Pakistani news sources. Urdu is a rich but complex language. Text processing can be challenging for Urdu due to its complex features as compared to other languages. Term frequency-inverse document frequency (TFIDF) based term weighting scheme for extracting features, chi-2 for selecting essential features, and Linear discriminant analysis (LDA) for dimensionality reduction have been used. TFIDF matrix and cosine similarity measure have been used to identify similar documents in a collection and find the semantic meaning of words in a document FastText model has been applied. The training-test split evaluation methodology is used for this experimentation, which includes 70% for training data and 30% for testing data. State-of-the-art machine learning and deep dense neural network approaches for Urdu news classiﬁcation have been used. Finally, we trained Multinomial Naïve Bayes, XGBoost, Bagging, and Deep dense neural network. Bagging and deep dense neural network outperformed the other algorithms. The experimental results show that deep dense achieves 92.0% mean f1 score, and Bagging 95.0% f1 score. 1546-2218; ESSN: 1546-2226 2021 Article PeerReviewed Bhatti, Muhammad Shahid and Ullah, Azmat and Latip, Rohaya and Sohail, Abid and Riaz, Anum and Hassan, Rohail (2021) Benchmarking performance of document level classification and topic modeling. CMC-Computers Materials & Continua, 71 (1). pp. 1-15. ISSN Tech Science Press https://www.techscience.com/cmc/v71n1/45375 10.32604/cmc.2022.020083
institution	Universiti Putra Malaysia
building	UPM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Putra Malaysia
content_source	UPM Institutional Repository
url_provider	http://psasir.upm.edu.my/
description	Text classification of low resource language is always a trivial and challenging problem. This paper discusses the process of Urdu news classiﬁcation and Urdu documents similarity. Urdu is one of the most famous spoken languages in Asia. The implementation of computational methodologies for text classiﬁcation has increased over time. However, Urdu language has not much experimented with research, it does not have readily available datasets, which turn out to be the primary reason behind limited research and applying the latest methodologies to the Urdu. To overcome these obstacles, a medium-sized dataset having six categories is collected from authentic Pakistani news sources. Urdu is a rich but complex language. Text processing can be challenging for Urdu due to its complex features as compared to other languages. Term frequency-inverse document frequency (TFIDF) based term weighting scheme for extracting features, chi-2 for selecting essential features, and Linear discriminant analysis (LDA) for dimensionality reduction have been used. TFIDF matrix and cosine similarity measure have been used to identify similar documents in a collection and find the semantic meaning of words in a document FastText model has been applied. The training-test split evaluation methodology is used for this experimentation, which includes 70% for training data and 30% for testing data. State-of-the-art machine learning and deep dense neural network approaches for Urdu news classiﬁcation have been used. Finally, we trained Multinomial Naïve Bayes, XGBoost, Bagging, and Deep dense neural network. Bagging and deep dense neural network outperformed the other algorithms. The experimental results show that deep dense achieves 92.0% mean f1 score, and Bagging 95.0% f1 score.
format	Article
author	Bhatti, Muhammad Shahid Ullah, Azmat Latip, Rohaya Sohail, Abid Riaz, Anum Hassan, Rohail
spellingShingle	Bhatti, Muhammad Shahid Ullah, Azmat Latip, Rohaya Sohail, Abid Riaz, Anum Hassan, Rohail Benchmarking performance of document level classification and topic modeling
author_facet	Bhatti, Muhammad Shahid Ullah, Azmat Latip, Rohaya Sohail, Abid Riaz, Anum Hassan, Rohail
author_sort	Bhatti, Muhammad Shahid
title	Benchmarking performance of document level classification and topic modeling
title_short	Benchmarking performance of document level classification and topic modeling
title_full	Benchmarking performance of document level classification and topic modeling
title_fullStr	Benchmarking performance of document level classification and topic modeling
title_full_unstemmed	Benchmarking performance of document level classification and topic modeling
title_sort	benchmarking performance of document level classification and topic modeling
publisher	1546-2218; ESSN: 1546-2226
publishDate	2021
url	http://psasir.upm.edu.my/id/eprint/96189/ https://www.techscience.com/cmc/v71n1/45375
_version_	1756685781631500288
score	13.211869

Benchmarking performance of document level classification and topic modeling

Similar Items