Bilingual Extractive Text Summarization Model using Textual Pattern Constraints

In the era of digital information, an auto-generated summary can help readers to easily find important and relevant information. Most of the studies and benchmark data sets in the field of text summarization are in English. Hence, there is a need to study the potential of Malay language in this fiel...

Full description

Saved in:
Bibliographic Details
Main Authors: Suraya Alias, Mohd Shamrie Sainin, Siti Khaotijah Mohammad
Format: Article
Language:English
English
Published: GEMA Online 2020
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/26542/1/Bilingual%20Extractive%20Text%20Summarization%20Model%20using%20Textual%20Pattern%20Constraints%20.pdf
https://eprints.ums.edu.my/id/eprint/26542/2/Bilingual%20Extractive%20Text%20Summarization%20Model%20using%20Textual%20Pattern%20Constraints%201.pdf
https://eprints.ums.edu.my/id/eprint/26542/
http://doi.org/10.17576/gema-2020-2003-05
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.ums.eprints.26542
record_format eprints
spelling my.ums.eprints.265422020-12-21T08:49:13Z https://eprints.ums.edu.my/id/eprint/26542/ Bilingual Extractive Text Summarization Model using Textual Pattern Constraints Suraya Alias Mohd Shamrie Sainin Siti Khaotijah Mohammad T Technology (General) In the era of digital information, an auto-generated summary can help readers to easily find important and relevant information. Most of the studies and benchmark data sets in the field of text summarization are in English. Hence, there is a need to study the potential of Malay language in this field. This study also highlights the problems in identifying and generating important information in extractive summaries. This is because existing text representation models such as BOW has weaknesses in inaccurate semantic representation, while the N-gram model has the issue of producing very high word vector dimensions. In this study, a bilingual text summarization model named MYTextSumBASIC has been developed to generate an extractive summary automatically in Malay and English. The MYTextSumBASIC summarizer model applies a text representation model known as FASP using three Textual Pattern Constraints, namely word item constraints, adjacent word constraints and sequence size constraints. There are three main phases in the framework of MYTextSumBASIC model, which are the development of the Malay language corpus, the development of MYTextSumBASIC model using FASP and the summary evaluation phase. In the summary evaluation phase, using the Malay language data sets of 100 news articles, the summaries produced by MYTextSumBASIC outperformed the summary generated by Baseline (Lead) and OTS summarizer with the highest average for retrieval (R) is 0.5849, precision (P) is 0.5736 and the F-score (Fm) is 0.5772. For manual evaluation by linguists, the MYTextSumBASIC method yielded a reading score of 4.1 and 3.87 for summary content generated using a random data set. Further experiments using the 2002 DUC English benchmark data set of 102 news articles have also shown that the MYTextSumBASIC model outperformed the best and lowest systems in the comparison with the mean retrieval values of ROUGE-1 (0.43896) and ROUGE-2 (0.19918). These findings conclude that the FASP text representation feature along with the textual pattern constraints used by our model can be used for bilingual text with competitive performance compared to other text summarization models. GEMA Online 2020 Article PeerReviewed text en https://eprints.ums.edu.my/id/eprint/26542/1/Bilingual%20Extractive%20Text%20Summarization%20Model%20using%20Textual%20Pattern%20Constraints%20.pdf text en https://eprints.ums.edu.my/id/eprint/26542/2/Bilingual%20Extractive%20Text%20Summarization%20Model%20using%20Textual%20Pattern%20Constraints%201.pdf Suraya Alias and Mohd Shamrie Sainin and Siti Khaotijah Mohammad (2020) Bilingual Extractive Text Summarization Model using Textual Pattern Constraints. Journal of Language Studies, 20 (3). pp. 70-95. ISSN 2550-2131 http://doi.org/10.17576/gema-2020-2003-05
institution Universiti Malaysia Sabah
building UMS Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sabah
content_source UMS Institutional Repository
url_provider http://eprints.ums.edu.my/
language English
English
topic T Technology (General)
spellingShingle T Technology (General)
Suraya Alias
Mohd Shamrie Sainin
Siti Khaotijah Mohammad
Bilingual Extractive Text Summarization Model using Textual Pattern Constraints
description In the era of digital information, an auto-generated summary can help readers to easily find important and relevant information. Most of the studies and benchmark data sets in the field of text summarization are in English. Hence, there is a need to study the potential of Malay language in this field. This study also highlights the problems in identifying and generating important information in extractive summaries. This is because existing text representation models such as BOW has weaknesses in inaccurate semantic representation, while the N-gram model has the issue of producing very high word vector dimensions. In this study, a bilingual text summarization model named MYTextSumBASIC has been developed to generate an extractive summary automatically in Malay and English. The MYTextSumBASIC summarizer model applies a text representation model known as FASP using three Textual Pattern Constraints, namely word item constraints, adjacent word constraints and sequence size constraints. There are three main phases in the framework of MYTextSumBASIC model, which are the development of the Malay language corpus, the development of MYTextSumBASIC model using FASP and the summary evaluation phase. In the summary evaluation phase, using the Malay language data sets of 100 news articles, the summaries produced by MYTextSumBASIC outperformed the summary generated by Baseline (Lead) and OTS summarizer with the highest average for retrieval (R) is 0.5849, precision (P) is 0.5736 and the F-score (Fm) is 0.5772. For manual evaluation by linguists, the MYTextSumBASIC method yielded a reading score of 4.1 and 3.87 for summary content generated using a random data set. Further experiments using the 2002 DUC English benchmark data set of 102 news articles have also shown that the MYTextSumBASIC model outperformed the best and lowest systems in the comparison with the mean retrieval values of ROUGE-1 (0.43896) and ROUGE-2 (0.19918). These findings conclude that the FASP text representation feature along with the textual pattern constraints used by our model can be used for bilingual text with competitive performance compared to other text summarization models.
format Article
author Suraya Alias
Mohd Shamrie Sainin
Siti Khaotijah Mohammad
author_facet Suraya Alias
Mohd Shamrie Sainin
Siti Khaotijah Mohammad
author_sort Suraya Alias
title Bilingual Extractive Text Summarization Model using Textual Pattern Constraints
title_short Bilingual Extractive Text Summarization Model using Textual Pattern Constraints
title_full Bilingual Extractive Text Summarization Model using Textual Pattern Constraints
title_fullStr Bilingual Extractive Text Summarization Model using Textual Pattern Constraints
title_full_unstemmed Bilingual Extractive Text Summarization Model using Textual Pattern Constraints
title_sort bilingual extractive text summarization model using textual pattern constraints
publisher GEMA Online
publishDate 2020
url https://eprints.ums.edu.my/id/eprint/26542/1/Bilingual%20Extractive%20Text%20Summarization%20Model%20using%20Textual%20Pattern%20Constraints%20.pdf
https://eprints.ums.edu.my/id/eprint/26542/2/Bilingual%20Extractive%20Text%20Summarization%20Model%20using%20Textual%20Pattern%20Constraints%201.pdf
https://eprints.ums.edu.my/id/eprint/26542/
http://doi.org/10.17576/gema-2020-2003-05
_version_ 1760230512314548224
score 13.160551