USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING

Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to...

Full description

Saved in:
Bibliographic Details
Main Author: MUFLIKHAH, LAILIL
Format: Thesis
Language:English
Published: 2010
Online Access:http://utpedia.utp.edu.my/2901/1/Thesis_Lailil%28G00639%29.pdf
http://utpedia.utp.edu.my/2901/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utp-utpedia.2901
record_format eprints
spelling my-utp-utpedia.29012017-01-25T09:43:22Z http://utpedia.utp.edu.my/2901/ USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING MUFLIKHAH, LAILIL Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to help users to retrieve information they need. Document clustering is an implementation of data mining task. By using similarity measurement of documents‟ characteristic, they can be clustered based on the same category or topic. High dimensionality of the document representation is due to representing of all substantial words in the vector space model. It is one of problems in document clustering that decreases the cluster quality performance including f-measure, entropy and accuracy. In categorical domain, many research have been conducted to reduce the dimension size of term-document matrix representation until by using keyword base. However, the result is obtained low accuracy in various class sizes of document collections. Therefore, this research is intended to improve the quality and accuracy of document clustering by using a method in information retrieval. A method in information retrieval, Latent Semantic Indexing (LSI), is proposed to reduce the dimension of term-document matrix for document representation. In this work, the LSI method is used to produce the patterns of terms, so that documents can be mapped into concept space. Based on the new representation, the documents are then subjected to the clustering algorithm itself, which is Fuzzy c-Means algorithm. A variant of distance measurement, cosine similarity, is also embedded to this algorithm. The results are then compared with some existing algorithms, which are used for benchmark purposes. The results show that the proposed method obtains high quality cluster and it is superior to the other fuzzy clustering algorithms for category i.e. FCCM, FSKWIC, and Fuzzy CoDoK with accuracy rate of over 90%. 2010 Thesis NonPeerReviewed application/pdf en http://utpedia.utp.edu.my/2901/1/Thesis_Lailil%28G00639%29.pdf MUFLIKHAH, LAILIL (2010) USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING. Masters thesis, UNIVERSITI TEKNOLOGI PETRONAS.
institution Universiti Teknologi Petronas
building UTP Resource Centre
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Petronas
content_source UTP Electronic and Digitized Intellectual Asset
url_provider http://utpedia.utp.edu.my/
language English
description Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to help users to retrieve information they need. Document clustering is an implementation of data mining task. By using similarity measurement of documents‟ characteristic, they can be clustered based on the same category or topic. High dimensionality of the document representation is due to representing of all substantial words in the vector space model. It is one of problems in document clustering that decreases the cluster quality performance including f-measure, entropy and accuracy. In categorical domain, many research have been conducted to reduce the dimension size of term-document matrix representation until by using keyword base. However, the result is obtained low accuracy in various class sizes of document collections. Therefore, this research is intended to improve the quality and accuracy of document clustering by using a method in information retrieval. A method in information retrieval, Latent Semantic Indexing (LSI), is proposed to reduce the dimension of term-document matrix for document representation. In this work, the LSI method is used to produce the patterns of terms, so that documents can be mapped into concept space. Based on the new representation, the documents are then subjected to the clustering algorithm itself, which is Fuzzy c-Means algorithm. A variant of distance measurement, cosine similarity, is also embedded to this algorithm. The results are then compared with some existing algorithms, which are used for benchmark purposes. The results show that the proposed method obtains high quality cluster and it is superior to the other fuzzy clustering algorithms for category i.e. FCCM, FSKWIC, and Fuzzy CoDoK with accuracy rate of over 90%.
format Thesis
author MUFLIKHAH, LAILIL
spellingShingle MUFLIKHAH, LAILIL
USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING
author_facet MUFLIKHAH, LAILIL
author_sort MUFLIKHAH, LAILIL
title USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING
title_short USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING
title_full USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING
title_fullStr USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING
title_full_unstemmed USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING
title_sort using latent semantic indexing for document clustering
publishDate 2010
url http://utpedia.utp.edu.my/2901/1/Thesis_Lailil%28G00639%29.pdf
http://utpedia.utp.edu.my/2901/
_version_ 1739830973797761024
score 13.159267