Staff View: Enhancement of parallel K-means algorithm for clustering big datasets

Enhancement of parallel K-means algorithm for clustering big datasets

Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical...

Full description

Saved in:

Bibliographic Details
Main Author:	Ashabi, Ardavan
Format:	Thesis
Language:	English
Published:	2022
Subjects:	T Technology (General)
Online Access:	http://eprints.utm.my/id/eprint/102827/1/ArdavanAshabiPRAZAK2022.pdf.pdf http://eprints.utm.my/id/eprint/102827/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:151605
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.utm.102827
record_format	eprints
spelling	my.utm.1028272023-09-24T03:20:48Z http://eprints.utm.my/id/eprint/102827/ Enhancement of parallel K-means algorithm for clustering big datasets Ashabi, Ardavan T Technology (General) Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical diagnosis systems by health service providers. Data mining aims to extract meaningful and valuable patterns from a set of raw data to transform data into meaningful information for better decision-making. However, Big Data is very complex and voluminous, and traditional methods of Data Mining are not capable to process and analyze this data efficiently. Data clustering, one of the main methods of data mining, eases the extraction of information from each cluster separately. Since 1960s, K-means algorithm has been known as one of the most classical techniques of data clustering. Even though there has been an extremely rich bibliography about improving the efficiency of K-means for years now, traditional K-means still suffers from some weaknesses, especially in dealing with Big Data. Despite many attempts to optimize K-means algorithm to handle Big Data using different techniques such as parallelization, the proposed methods are still not able to cluster Big Datasets efficiently due to lack of improvement in some effective parameters such as the number of clusters and the initial clusters' centroids. This study aims to understand the current limitations of K-means algorithm and to overcome the limitations in order to produce more efficient performance in clustering big datasets from healthcare domain. To develop the optimized extension of K-means algorithm, a systematic literature review (SLR) was conducted to investigate the current limitations and existing solutions for the K-means limitations over Big Data. Based on the the SLR, this study proposed an enhanced parallel version of K-means clustering algorithm to reduce the execution time of the clustering process over the big datasets with the minimum negative impact on the clustering’s accuracy. Determining the optimum number of clusters, obtaining the suitable initial centroids, and improving the process of parallelization were the three steps of the optimization process. To avoid any random results, the proposed hybrid solution defined the optimum number of clusters by using elbow method. In addition, the proposed algorithm obtained the ideal initial centroids by utilizing a careful seed selection method, performing K-means with a fuzzy technique to increase the precision of the clustering, and parallelizing the clustering process by using Hadoop platform with the optimized Map and Reduce functions to reduce the execution time of the process. The evaluation of the proposed algorithm revealed that the new method performed the clustering process over multiple big datasets with shorter execution time compared to the study’s benchmarks: Apache Mahout K-means, K-means++, and Fuzzy K-means. Also, the results of the three selected cluster validity indices - Silhouette, Dunn, and Davies-Bouldin - verified that there was no negative impact on the quality of the clusters. 2022 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/102827/1/ArdavanAshabiPRAZAK2022.pdf.pdf Ashabi, Ardavan (2022) Enhancement of parallel K-means algorithm for clustering big datasets. PhD thesis, Universiti Teknologi Malaysia. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:151605
institution	Universiti Teknologi Malaysia
building	UTM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Malaysia
content_source	UTM Institutional Repository
url_provider	http://eprints.utm.my/
language	English
topic	T Technology (General)
spellingShingle	T Technology (General) Ashabi, Ardavan Enhancement of parallel K-means algorithm for clustering big datasets
description	Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical diagnosis systems by health service providers. Data mining aims to extract meaningful and valuable patterns from a set of raw data to transform data into meaningful information for better decision-making. However, Big Data is very complex and voluminous, and traditional methods of Data Mining are not capable to process and analyze this data efficiently. Data clustering, one of the main methods of data mining, eases the extraction of information from each cluster separately. Since 1960s, K-means algorithm has been known as one of the most classical techniques of data clustering. Even though there has been an extremely rich bibliography about improving the efficiency of K-means for years now, traditional K-means still suffers from some weaknesses, especially in dealing with Big Data. Despite many attempts to optimize K-means algorithm to handle Big Data using different techniques such as parallelization, the proposed methods are still not able to cluster Big Datasets efficiently due to lack of improvement in some effective parameters such as the number of clusters and the initial clusters' centroids. This study aims to understand the current limitations of K-means algorithm and to overcome the limitations in order to produce more efficient performance in clustering big datasets from healthcare domain. To develop the optimized extension of K-means algorithm, a systematic literature review (SLR) was conducted to investigate the current limitations and existing solutions for the K-means limitations over Big Data. Based on the the SLR, this study proposed an enhanced parallel version of K-means clustering algorithm to reduce the execution time of the clustering process over the big datasets with the minimum negative impact on the clustering’s accuracy. Determining the optimum number of clusters, obtaining the suitable initial centroids, and improving the process of parallelization were the three steps of the optimization process. To avoid any random results, the proposed hybrid solution defined the optimum number of clusters by using elbow method. In addition, the proposed algorithm obtained the ideal initial centroids by utilizing a careful seed selection method, performing K-means with a fuzzy technique to increase the precision of the clustering, and parallelizing the clustering process by using Hadoop platform with the optimized Map and Reduce functions to reduce the execution time of the process. The evaluation of the proposed algorithm revealed that the new method performed the clustering process over multiple big datasets with shorter execution time compared to the study’s benchmarks: Apache Mahout K-means, K-means++, and Fuzzy K-means. Also, the results of the three selected cluster validity indices - Silhouette, Dunn, and Davies-Bouldin - verified that there was no negative impact on the quality of the clusters.
format	Thesis
author	Ashabi, Ardavan
author_facet	Ashabi, Ardavan
author_sort	Ashabi, Ardavan
title	Enhancement of parallel K-means algorithm for clustering big datasets
title_short	Enhancement of parallel K-means algorithm for clustering big datasets
title_full	Enhancement of parallel K-means algorithm for clustering big datasets
title_fullStr	Enhancement of parallel K-means algorithm for clustering big datasets
title_full_unstemmed	Enhancement of parallel K-means algorithm for clustering big datasets
title_sort	enhancement of parallel k-means algorithm for clustering big datasets
publishDate	2022
url	http://eprints.utm.my/id/eprint/102827/1/ArdavanAshabiPRAZAK2022.pdf.pdf http://eprints.utm.my/id/eprint/102827/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:151605
_version_	1778160787666239488
score	13.160551

Enhancement of parallel K-means algorithm for clustering big datasets

Similar Items