A comparative study in classification techniques for unsupervised record linkage model

Problem statement: Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage algorithms with different steps have been developed in order to detect such duplicate records. To find out whether two rec...

Full description

Saved in:
Bibliographic Details
Main Authors: Ektefa, Mohammadreza, Sidi, Fatimah, Ibrahim, Hamidah, A. Jabar, Marzanah, Memar, Sara
Format: Article
Language:English
Published: Science Publications 2011
Online Access:http://psasir.upm.edu.my/id/eprint/22451/1/jcssp.2011.341.347.pdf
http://psasir.upm.edu.my/id/eprint/22451/
http://thescipub.com/html/10.3844/jcssp.2011.341.347
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.upm.eprints.22451
record_format eprints
spelling my.upm.eprints.224512016-06-08T08:32:21Z http://psasir.upm.edu.my/id/eprint/22451/ A comparative study in classification techniques for unsupervised record linkage model Ektefa, Mohammadreza Sidi, Fatimah Ibrahim, Hamidah A. Jabar, Marzanah Memar, Sara Problem statement: Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage algorithms with different steps have been developed in order to detect such duplicate records. To find out whether two records are duplicate or not, supervised and unsupervised classification techniques are utilized in different studies. In order to utilize the supervised classification algorithms without consuming a lot of time for labeling data manually, a two step method which selects the training data automatically has been proposed in previous studies. However, the effectiveness of different classification techniques is the issue which should be taken into accounts in record linkage systems in order to classify records more accurately. Approach: To determine and compare the effectiveness of different supervised classification techniques in an unsupervised manner, some of the prominent classification methods are applied in duplicate records detection. Duplicate detection and classification of records in two real world datasets, namely Cora and Restaurant is experimented by Support Vector Machines, Naïve Bayes, Decision Tree and Bayesian Networks which are regarded as some prominent classification techniques. Results: As experimental results show, while Support Vector Machines outperforms with F-measure of 96.27% in Restaurant dataset, for Cora dataset, the effectiveness of Naïve Bayes is the best and it leads to an improvement with F-measure of 89.7%. Conclusion/Recommendation: The result of detecting duplicate records with different classification techniques tends to fluctuate depending on the dataset which is used. Moreover, Support Vector Machines and Naïve Bayes outperform other methods in our experiments. Science Publications 2011 Article PeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/22451/1/jcssp.2011.341.347.pdf Ektefa, Mohammadreza and Sidi, Fatimah and Ibrahim, Hamidah and A. Jabar, Marzanah and Memar, Sara (2011) A comparative study in classification techniques for unsupervised record linkage model. Journal of Computer Science, 7 (3). pp. 341-347. ISSN 1549-3636; ESSN: 1552-6607 http://thescipub.com/html/10.3844/jcssp.2011.341.347 10.3844/jcssp.2011.341.347
institution Universiti Putra Malaysia
building UPM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Putra Malaysia
content_source UPM Institutional Repository
url_provider http://psasir.upm.edu.my/
language English
description Problem statement: Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage algorithms with different steps have been developed in order to detect such duplicate records. To find out whether two records are duplicate or not, supervised and unsupervised classification techniques are utilized in different studies. In order to utilize the supervised classification algorithms without consuming a lot of time for labeling data manually, a two step method which selects the training data automatically has been proposed in previous studies. However, the effectiveness of different classification techniques is the issue which should be taken into accounts in record linkage systems in order to classify records more accurately. Approach: To determine and compare the effectiveness of different supervised classification techniques in an unsupervised manner, some of the prominent classification methods are applied in duplicate records detection. Duplicate detection and classification of records in two real world datasets, namely Cora and Restaurant is experimented by Support Vector Machines, Naïve Bayes, Decision Tree and Bayesian Networks which are regarded as some prominent classification techniques. Results: As experimental results show, while Support Vector Machines outperforms with F-measure of 96.27% in Restaurant dataset, for Cora dataset, the effectiveness of Naïve Bayes is the best and it leads to an improvement with F-measure of 89.7%. Conclusion/Recommendation: The result of detecting duplicate records with different classification techniques tends to fluctuate depending on the dataset which is used. Moreover, Support Vector Machines and Naïve Bayes outperform other methods in our experiments.
format Article
author Ektefa, Mohammadreza
Sidi, Fatimah
Ibrahim, Hamidah
A. Jabar, Marzanah
Memar, Sara
spellingShingle Ektefa, Mohammadreza
Sidi, Fatimah
Ibrahim, Hamidah
A. Jabar, Marzanah
Memar, Sara
A comparative study in classification techniques for unsupervised record linkage model
author_facet Ektefa, Mohammadreza
Sidi, Fatimah
Ibrahim, Hamidah
A. Jabar, Marzanah
Memar, Sara
author_sort Ektefa, Mohammadreza
title A comparative study in classification techniques for unsupervised record linkage model
title_short A comparative study in classification techniques for unsupervised record linkage model
title_full A comparative study in classification techniques for unsupervised record linkage model
title_fullStr A comparative study in classification techniques for unsupervised record linkage model
title_full_unstemmed A comparative study in classification techniques for unsupervised record linkage model
title_sort comparative study in classification techniques for unsupervised record linkage model
publisher Science Publications
publishDate 2011
url http://psasir.upm.edu.my/id/eprint/22451/1/jcssp.2011.341.347.pdf
http://psasir.upm.edu.my/id/eprint/22451/
http://thescipub.com/html/10.3844/jcssp.2011.341.347
_version_ 1643827831884480512
score 13.18916