Exploring Canonical Data Model for Text Clustering (S/O 12828)

The abundance of text data have been witnessed with the growth of web and other text repositories. There is an important need to provide improved mechanism to effectively represent and retrieve text data. This paper advocates the construction of canonical data models for mapping contents of multi do...

Full description

Saved in:
Bibliographic Details
Main Authors: Kamaruddin, Siti Sakira, Yusof, Yuhanis
Format: Monograph
Language:English
Published: UUM
Subjects:
Online Access:https://repo.uum.edu.my/id/eprint/31505/1/12828.pdf
https://repo.uum.edu.my/id/eprint/31505/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.uum.repo.31505
record_format eprints
spelling my.uum.repo.315052024-11-18T08:46:44Z https://repo.uum.edu.my/id/eprint/31505/ Exploring Canonical Data Model for Text Clustering (S/O 12828) Kamaruddin, Siti Sakira Yusof, Yuhanis T Technology (General) The abundance of text data have been witnessed with the growth of web and other text repositories. There is an important need to provide improved mechanism to effectively represent and retrieve text data. This paper advocates the construction of canonical data models for mapping contents of multi documents into a few general models that can represent the corpus. However to construct canonical data model for text, it involves non-trivial text mining techniques prior to the actual construction process. Furthermore constructing canonical data models for all terms in a set of documents will be costly and will not reduce the sparsity problem that are associated with text document processing. In order to solve this problem we propose a two tier dimensionality reduction step adopting commonly used feature extraction and feature selection methods. The reduced features are then used to construct a canonical data model. A canonical data model for text documents can be used as a general model that has potential to act as a reference model for text comparison in a wide variety of text mining tasks such as text clustering, text classification, text summarization and text deviation detection. Experimental result reveals that the proposed approach produces better results compared to methods without canonical data model UUM Monograph NonPeerReviewed application/pdf en https://repo.uum.edu.my/id/eprint/31505/1/12828.pdf Kamaruddin, Siti Sakira and Yusof, Yuhanis Exploring Canonical Data Model for Text Clustering (S/O 12828). Project Report. UUM. (Submitted)
institution Universiti Utara Malaysia
building UUM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Utara Malaysia
content_source UUM Institutional Repository
url_provider http://repo.uum.edu.my/
language English
topic T Technology (General)
spellingShingle T Technology (General)
Kamaruddin, Siti Sakira
Yusof, Yuhanis
Exploring Canonical Data Model for Text Clustering (S/O 12828)
description The abundance of text data have been witnessed with the growth of web and other text repositories. There is an important need to provide improved mechanism to effectively represent and retrieve text data. This paper advocates the construction of canonical data models for mapping contents of multi documents into a few general models that can represent the corpus. However to construct canonical data model for text, it involves non-trivial text mining techniques prior to the actual construction process. Furthermore constructing canonical data models for all terms in a set of documents will be costly and will not reduce the sparsity problem that are associated with text document processing. In order to solve this problem we propose a two tier dimensionality reduction step adopting commonly used feature extraction and feature selection methods. The reduced features are then used to construct a canonical data model. A canonical data model for text documents can be used as a general model that has potential to act as a reference model for text comparison in a wide variety of text mining tasks such as text clustering, text classification, text summarization and text deviation detection. Experimental result reveals that the proposed approach produces better results compared to methods without canonical data model
format Monograph
author Kamaruddin, Siti Sakira
Yusof, Yuhanis
author_facet Kamaruddin, Siti Sakira
Yusof, Yuhanis
author_sort Kamaruddin, Siti Sakira
title Exploring Canonical Data Model for Text Clustering (S/O 12828)
title_short Exploring Canonical Data Model for Text Clustering (S/O 12828)
title_full Exploring Canonical Data Model for Text Clustering (S/O 12828)
title_fullStr Exploring Canonical Data Model for Text Clustering (S/O 12828)
title_full_unstemmed Exploring Canonical Data Model for Text Clustering (S/O 12828)
title_sort exploring canonical data model for text clustering (s/o 12828)
publisher UUM
url https://repo.uum.edu.my/id/eprint/31505/1/12828.pdf
https://repo.uum.edu.my/id/eprint/31505/
_version_ 1816134263486021632
score 13.214268