Staff View: Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm

Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm

Clustering is an unsupervised classification method with major aim of partitioning, where objects in the same cluster are similar, and objects belong to different clusters vary significantly, with respect to their attributes. The K-Means algorithm is the commonest and fast technique in partitiona...

Full description

Saved in:

Bibliographic Details
Main Author:	Dalatu, Paul Inuwa
Format:	Thesis
Language:	English
Published:	2018
Online Access:	http://psasir.upm.edu.my/id/eprint/68681/1/FS%202018%2026%20-%20IR.pdf http://psasir.upm.edu.my/id/eprint/68681/
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.upm.eprints.68681
record_format	eprints
spelling	my.upm.eprints.686812019-05-28T02:45:09Z http://psasir.upm.edu.my/id/eprint/68681/ Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm Dalatu, Paul Inuwa Clustering is an unsupervised classification method with major aim of partitioning, where objects in the same cluster are similar, and objects belong to different clusters vary significantly, with respect to their attributes. The K-Means algorithm is the commonest and fast technique in partitional cluster algorithms, although with unnormalized datasets it can achieve local optimal. We introduced two new approaches to normalization techniques to enhance the K-Means algorithms. This is to remedy the problem of using the existing Min-Max (MM) and Decimal Scaling (DS) techniques, which have overflow weakness. The suggested approaches are called new approach to min-max (NAMM) and decimal scaling (NADS). The Hybrid mean algorithms which are based on spherical clusters is also proposed to remedy the most significant limitation of the K-Means and K-Midranges algorithms. It is attained successfully by combining the mean in K-Means algorithm, minimum and maximum in K-Midranges algorithm and compute their average as mean cluster of Hybrid mean. The problem of using range function in Heterogeneous Euclidean-Overlap Metric (HEOM) is addressed by replacing the range with interquartile range function called Interquartile Range-Heterogeneous Metric (IQR-HEOM). Dividing the HEOM with range allows outliers to have big effect on the contribution of attributes. Hence, We proposed interquartile range which is more resistance against outliers in data pre-processing. It shows that the IQR-HEOM method is more efficient to rectify the problem caused by using range in HEOM. The Standardized Euclidean distance which uses standard deviation to down weight maximum points of the ith features on the distance clusters are being criticized in the literature by many researchers that the method is prone to outliers and has 0% breakdown points. Therefore, to remedy the problem, we introduced two statistical estimators called Qn and Sn estimator, both have 50% breakdown points, with their efficiency as 58% and 82% for Sn and Qn, respectively. The empirical evidences show that the two suggested methods are more efficient compared to the existing methods. 2018-01 Thesis NonPeerReviewed text en http://psasir.upm.edu.my/id/eprint/68681/1/FS%202018%2026%20-%20IR.pdf Dalatu, Paul Inuwa (2018) Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm. PhD thesis, Universiti Putra Malaysia.
institution	Universiti Putra Malaysia
building	UPM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Putra Malaysia
content_source	UPM Institutional Repository
url_provider	http://psasir.upm.edu.my/
language	English
description	Clustering is an unsupervised classification method with major aim of partitioning, where objects in the same cluster are similar, and objects belong to different clusters vary significantly, with respect to their attributes. The K-Means algorithm is the commonest and fast technique in partitional cluster algorithms, although with unnormalized datasets it can achieve local optimal. We introduced two new approaches to normalization techniques to enhance the K-Means algorithms. This is to remedy the problem of using the existing Min-Max (MM) and Decimal Scaling (DS) techniques, which have overflow weakness. The suggested approaches are called new approach to min-max (NAMM) and decimal scaling (NADS). The Hybrid mean algorithms which are based on spherical clusters is also proposed to remedy the most significant limitation of the K-Means and K-Midranges algorithms. It is attained successfully by combining the mean in K-Means algorithm, minimum and maximum in K-Midranges algorithm and compute their average as mean cluster of Hybrid mean. The problem of using range function in Heterogeneous Euclidean-Overlap Metric (HEOM) is addressed by replacing the range with interquartile range function called Interquartile Range-Heterogeneous Metric (IQR-HEOM). Dividing the HEOM with range allows outliers to have big effect on the contribution of attributes. Hence, We proposed interquartile range which is more resistance against outliers in data pre-processing. It shows that the IQR-HEOM method is more efficient to rectify the problem caused by using range in HEOM. The Standardized Euclidean distance which uses standard deviation to down weight maximum points of the ith features on the distance clusters are being criticized in the literature by many researchers that the method is prone to outliers and has 0% breakdown points. Therefore, to remedy the problem, we introduced two statistical estimators called Qn and Sn estimator, both have 50% breakdown points, with their efficiency as 58% and 82% for Sn and Qn, respectively. The empirical evidences show that the two suggested methods are more efficient compared to the existing methods.
format	Thesis
author	Dalatu, Paul Inuwa
spellingShingle	Dalatu, Paul Inuwa Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
author_facet	Dalatu, Paul Inuwa
author_sort	Dalatu, Paul Inuwa
title	Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
title_short	Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
title_full	Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
title_fullStr	Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
title_full_unstemmed	Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
title_sort	statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm
publishDate	2018
url	http://psasir.upm.edu.my/id/eprint/68681/1/FS%202018%2026%20-%20IR.pdf http://psasir.upm.edu.my/id/eprint/68681/
_version_	1643839274524606464
score	13.160551

Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm

Similar Items