Lexical paraphrase extraction with multiple semantic information

Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. D...

Full description

Saved in:
Bibliographic Details
Main Author: Ho, Chuk Fong
Format: Thesis
Language:English
Published: 2012
Online Access:http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf
http://psasir.upm.edu.my/id/eprint/30924/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.30924
record_format eprints
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
description Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. Due to this phenomenon called variability, NLP becomes a difficult task. Since paraphrases are different words, phrases or sentences that express the same or almost the same meaning, a variety of paraphrase extraction methods have been proposed believing that paraphrases can be used to capture this variability. In general, paraphrase extraction methods can be categorized into corpus–based and knowledge–based. A corpus–based method is dependent on syntax information (rules that govern the arrangement of words to form phrases in sentences) while a knowledge–based method is dependent on semantic information (the study of meanings). However, previous studies have shown that depending on syntax information alone can result in mistakenly extracting antonyms and barely related or unrelated words as paraphrases. Semantics on the other hand is a complex study of meanings. Therefore, extracting ,paraphrases based on shallow or a single instance of semantic information only such as synonyms or semantic relations would be ineffective as it has no difference from solving a complex problem based on incomplete information. The main purpose of this thesis is to propose a new model, called Multilayer Semantic–based Validation Paraphrase Extraction (MSVPE), which relies on the use of different types of semantic information. In particular, MSVPE collects paraphrase candidates instances of lexical resources. Then, it validates the candidates using word similarity method, sentence similarity method and domain matching technique which correspond to the use of semantic relations, definitions and domains respectively. However, there are some flaws in the existing sentence similarity methods and word similarity methods. In particular, sentence similarity methods determine the semantic similarity between sentences based on the incorrect interpretation of meaning from each sentence and with incomplete information. Word similarity methods on the other hand derive the semantic similarity between words based on multiple features which have not been processed and combined properly. Consequently, similarity judgments produced by them are not reliable. To address these problems, we also proposed: 1) a new sentence similarity method (SSMv1) that compares the actual meaning of each sentence, 2)another sentence similarity method (SSMv2) that takes into consideration multiple pieces of information, and 3) a new word similarity method (WSM) that makes use of optimally processed and combined features. In order to evaluate MSVPE, SSMv1, SSMv2 and WSM, four different experiments have been conducted on three different data sets. SSMv1, SSMv2 and WSM were tested on two standard data sets which consist of 30 pairs of definitions and 65 pairs of nouns respectively that ranged from highly synonymous to semantically unrelated and which have widely been applied for evaluation purposes. In contrast, MSVPE was tested on a data set created in this study which consists of 85 words and 56 sentences. Experimental results showed that compared with the two benchmarks based solely on syntax information, MSVPE can extract paraphrases more effectively. This is probably because semantic information is more related to meanings than syntax information. Results further showed that MSVPE with multiple instances of semantic information outperforms MSVPE with only a single instance of semantic information. Although the effectiveness of different semantic information varies, they are complementary. Experimental results also showed that SSMv1, SSMv2 and WSM outperform all of their benchmarks significantly, thus indicating that they can better simulate human inferring capability. The reason is that SSMv1 has the correct understanding of the meaning of each sentence while SSMv2 makes use of information that is complementary. WSM on the other hand consists of the optimized transformation of different types of features and the optimized combination of them representing the nearest replica of human thinking behavior.
format Thesis
author Ho, Chuk Fong
spellingShingle Ho, Chuk Fong
Lexical paraphrase extraction with multiple semantic information
author_facet Ho, Chuk Fong
author_sort Ho, Chuk Fong
title Lexical paraphrase extraction with multiple semantic information
title_short Lexical paraphrase extraction with multiple semantic information
title_full Lexical paraphrase extraction with multiple semantic information
title_fullStr Lexical paraphrase extraction with multiple semantic information
title_full_unstemmed Lexical paraphrase extraction with multiple semantic information
title_sort lexical paraphrase extraction with multiple semantic information
publishDate 2012
url http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf
http://psasir.upm.edu.my/id/eprint/30924/
_version_ 1643830206661656576
spelling my.upm.eprints.309242015-02-05T09:50:19Z http://psasir.upm.edu.my/id/eprint/30924/ Lexical paraphrase extraction with multiple semantic information Ho, Chuk Fong Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. Due to this phenomenon called variability, NLP becomes a difficult task. Since paraphrases are different words, phrases or sentences that express the same or almost the same meaning, a variety of paraphrase extraction methods have been proposed believing that paraphrases can be used to capture this variability. In general, paraphrase extraction methods can be categorized into corpus–based and knowledge–based. A corpus–based method is dependent on syntax information (rules that govern the arrangement of words to form phrases in sentences) while a knowledge–based method is dependent on semantic information (the study of meanings). However, previous studies have shown that depending on syntax information alone can result in mistakenly extracting antonyms and barely related or unrelated words as paraphrases. Semantics on the other hand is a complex study of meanings. Therefore, extracting ,paraphrases based on shallow or a single instance of semantic information only such as synonyms or semantic relations would be ineffective as it has no difference from solving a complex problem based on incomplete information. The main purpose of this thesis is to propose a new model, called Multilayer Semantic–based Validation Paraphrase Extraction (MSVPE), which relies on the use of different types of semantic information. In particular, MSVPE collects paraphrase candidates instances of lexical resources. Then, it validates the candidates using word similarity method, sentence similarity method and domain matching technique which correspond to the use of semantic relations, definitions and domains respectively. However, there are some flaws in the existing sentence similarity methods and word similarity methods. In particular, sentence similarity methods determine the semantic similarity between sentences based on the incorrect interpretation of meaning from each sentence and with incomplete information. Word similarity methods on the other hand derive the semantic similarity between words based on multiple features which have not been processed and combined properly. Consequently, similarity judgments produced by them are not reliable. To address these problems, we also proposed: 1) a new sentence similarity method (SSMv1) that compares the actual meaning of each sentence, 2)another sentence similarity method (SSMv2) that takes into consideration multiple pieces of information, and 3) a new word similarity method (WSM) that makes use of optimally processed and combined features. In order to evaluate MSVPE, SSMv1, SSMv2 and WSM, four different experiments have been conducted on three different data sets. SSMv1, SSMv2 and WSM were tested on two standard data sets which consist of 30 pairs of definitions and 65 pairs of nouns respectively that ranged from highly synonymous to semantically unrelated and which have widely been applied for evaluation purposes. In contrast, MSVPE was tested on a data set created in this study which consists of 85 words and 56 sentences. Experimental results showed that compared with the two benchmarks based solely on syntax information, MSVPE can extract paraphrases more effectively. This is probably because semantic information is more related to meanings than syntax information. Results further showed that MSVPE with multiple instances of semantic information outperforms MSVPE with only a single instance of semantic information. Although the effectiveness of different semantic information varies, they are complementary. Experimental results also showed that SSMv1, SSMv2 and WSM outperform all of their benchmarks significantly, thus indicating that they can better simulate human inferring capability. The reason is that SSMv1 has the correct understanding of the meaning of each sentence while SSMv2 makes use of information that is complementary. WSM on the other hand consists of the optimized transformation of different types of features and the optimized combination of them representing the nearest replica of human thinking behavior. 2012-08 Thesis NonPeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf Ho, Chuk Fong (2012) Lexical paraphrase extraction with multiple semantic information. PhD thesis, Universiti Putra Malaysia.
score 13.209306