Web based cross language semantic plagiarism detectio

Recently, cross language and semantic plagiarism are on the rise. Many plagiarism detection tools are not capable to detect such plagiarism cases. In this research, we propose a new framework which involves summarization, cross language and semantic plagiarism detection. We consider Bahasa Melayu as...

Full description

Saved in:
Bibliographic Details
Main Author: Chow, Kok Kent
Format: Thesis
Published: 2013
Subjects:
Online Access:http://eprints.utm.my/id/eprint/42237/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:77839
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.42237
record_format eprints
spelling my.utm.422372020-08-19T08:11:49Z http://eprints.utm.my/id/eprint/42237/ Web based cross language semantic plagiarism detectio Chow, Kok Kent QA76 Computer software Recently, cross language and semantic plagiarism are on the rise. Many plagiarism detection tools are not capable to detect such plagiarism cases. In this research, we propose a new framework which involves summarization, cross language and semantic plagiarism detection. We consider Bahasa Melayu as the input language of the submitted document and English as the language of, possibly plagiarised documents. In this framework we shorten the query document by utilising fuzzy swarm-based summarisation approach. With this summarisation approach, sentences are chosen based on their importance level that determined by five predefined sentence features, which integrated with fuzzy logic. This technique is chosen for its effectiveness achieved in previous research. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. We integrate the use of Stanford Parser and WordNet to determine the semantic similarity level between the suspected documents and candidate source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as nouns, verbs and adjectives. Based on these roles, we represent each sentence in a predicate form and similarity is measured based on those predicates using information content value from WordNet taxonomy. The testing dataset is built up from two sets of Malay documents which are produced based on different plagiarism practices. The result of our proposed semantic based similarity measurement shows that it can achieve higher precision, recall and f-measure compared to the conventional Longest Common Subsequence (LCS) approach, which determines similarity between sentences based on their common subsequence from left to right with maximum length, regardless of their consecutive arrangement. 2013 Thesis NonPeerReviewed Chow, Kok Kent (2013) Web based cross language semantic plagiarism detectio. Masters thesis, Universiti Teknologi Malaysia, Faculty of Computing. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:77839
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
topic QA76 Computer software
spellingShingle QA76 Computer software
Chow, Kok Kent
Web based cross language semantic plagiarism detectio
description Recently, cross language and semantic plagiarism are on the rise. Many plagiarism detection tools are not capable to detect such plagiarism cases. In this research, we propose a new framework which involves summarization, cross language and semantic plagiarism detection. We consider Bahasa Melayu as the input language of the submitted document and English as the language of, possibly plagiarised documents. In this framework we shorten the query document by utilising fuzzy swarm-based summarisation approach. With this summarisation approach, sentences are chosen based on their importance level that determined by five predefined sentence features, which integrated with fuzzy logic. This technique is chosen for its effectiveness achieved in previous research. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. We integrate the use of Stanford Parser and WordNet to determine the semantic similarity level between the suspected documents and candidate source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as nouns, verbs and adjectives. Based on these roles, we represent each sentence in a predicate form and similarity is measured based on those predicates using information content value from WordNet taxonomy. The testing dataset is built up from two sets of Malay documents which are produced based on different plagiarism practices. The result of our proposed semantic based similarity measurement shows that it can achieve higher precision, recall and f-measure compared to the conventional Longest Common Subsequence (LCS) approach, which determines similarity between sentences based on their common subsequence from left to right with maximum length, regardless of their consecutive arrangement.
format Thesis
author Chow, Kok Kent
author_facet Chow, Kok Kent
author_sort Chow, Kok Kent
title Web based cross language semantic plagiarism detectio
title_short Web based cross language semantic plagiarism detectio
title_full Web based cross language semantic plagiarism detectio
title_fullStr Web based cross language semantic plagiarism detectio
title_full_unstemmed Web based cross language semantic plagiarism detectio
title_sort web based cross language semantic plagiarism detectio
publishDate 2013
url http://eprints.utm.my/id/eprint/42237/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:77839
_version_ 1677781071382446080
score 13.188404