Staff View: Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems

Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems

Plagiarism is a major problem in the academic world. It does not only undermine the credibility of educational institutions, but also interrupts the processes of creating knowledge in the academic community. To lessen this problem, many plagiarism detection systems have been developed to detect p...

Full description

Saved in:

Bibliographic Details
Main Authors:	Supawat Taerungruang,, Wirote Aroonmanakun,
Format:	Article
Language:	English
Published:	Penerbit Universiti Kebangsaan Malaysia 2018
Online Access:	http://journalarticle.ukm.my/17615/1/23578-82995-1-PB.pdf http://journalarticle.ukm.my/17615/ https://ejournal.ukm.my/gema/issue/view/1098
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-ukm.journal.17615
record_format	eprints
spelling	my-ukm.journal.176152021-11-22T06:21:11Z http://journalarticle.ukm.my/17615/ Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems Supawat Taerungruang, Wirote Aroonmanakun, Plagiarism is a major problem in the academic world. It does not only undermine the credibility of educational institutions, but also interrupts the processes of creating knowledge in the academic community. To lessen this problem, many plagiarism detection systems have been developed to detect plagiarized texts in academic works. In this paper, we describe the design and process in creating an academic Thai plagiarism corpus. This corpus is necessary for training and testing plagiarism detection systems for Thai. In order to make this corpus a comprehensive representation of plagiarism, the data has been divided into various types based on the degree of the linguistic mechanisms used in plagiarism. Data compiled in our corpus comes through two main methods: manually created by participants and automatically generated by a program. After the corpus is created, its validity is verified by using three measurements: a measurement of similarity between suspicious texts at the character level, a measurement of similarity between suspicious texts at the word level, and a comparison of different types of data compiled in the corpus based on the similarity measured. The results of the analyses indicate that the corpus created by the proposed methods is effective in training and testing plagiarism detection systems. Penerbit Universiti Kebangsaan Malaysia 2018-08 Article PeerReviewed application/pdf en http://journalarticle.ukm.my/17615/1/23578-82995-1-PB.pdf Supawat Taerungruang, and Wirote Aroonmanakun, (2018) Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems. GEMA ; Online Journal of Language Studies, 18 (3). pp. 186-202. ISSN 1675-8021 https://ejournal.ukm.my/gema/issue/view/1098
institution	Universiti Kebangsaan Malaysia
building	Tun Sri Lanang Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Kebangsaan Malaysia
content_source	UKM Journal Article Repository
url_provider	http://journalarticle.ukm.my/
language	English
description	Plagiarism is a major problem in the academic world. It does not only undermine the credibility of educational institutions, but also interrupts the processes of creating knowledge in the academic community. To lessen this problem, many plagiarism detection systems have been developed to detect plagiarized texts in academic works. In this paper, we describe the design and process in creating an academic Thai plagiarism corpus. This corpus is necessary for training and testing plagiarism detection systems for Thai. In order to make this corpus a comprehensive representation of plagiarism, the data has been divided into various types based on the degree of the linguistic mechanisms used in plagiarism. Data compiled in our corpus comes through two main methods: manually created by participants and automatically generated by a program. After the corpus is created, its validity is verified by using three measurements: a measurement of similarity between suspicious texts at the character level, a measurement of similarity between suspicious texts at the word level, and a comparison of different types of data compiled in the corpus based on the similarity measured. The results of the analyses indicate that the corpus created by the proposed methods is effective in training and testing plagiarism detection systems.
format	Article
author	Supawat Taerungruang, Wirote Aroonmanakun,
spellingShingle	Supawat Taerungruang, Wirote Aroonmanakun, Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
author_facet	Supawat Taerungruang, Wirote Aroonmanakun,
author_sort	Supawat Taerungruang,
title	Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
title_short	Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
title_full	Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
title_fullStr	Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
title_full_unstemmed	Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
title_sort	constructing an academic thai plagiarism corpus for benchmarking plagiarism detection systems
publisher	Penerbit Universiti Kebangsaan Malaysia
publishDate	2018
url	http://journalarticle.ukm.my/17615/1/23578-82995-1-PB.pdf http://journalarticle.ukm.my/17615/ https://ejournal.ukm.my/gema/issue/view/1098
_version_	1718927136527482880
score	13.160551

Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems

Similar Items