Enhancement of parallel K-means algorithm for clustering big datasets

Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical...

Full description

Saved in:
Bibliographic Details
Main Author: Ashabi, Ardavan
Format: Thesis
Language:English
Published: 2022
Subjects:
Online Access:http://eprints.utm.my/id/eprint/102827/1/ArdavanAshabiPRAZAK2022.pdf.pdf
http://eprints.utm.my/id/eprint/102827/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:151605
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.utm.102827
record_format eprints
spelling my.utm.1028272023-09-24T03:20:48Z http://eprints.utm.my/id/eprint/102827/ Enhancement of parallel K-means algorithm for clustering big datasets Ashabi, Ardavan T Technology (General) Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical diagnosis systems by health service providers. Data mining aims to extract meaningful and valuable patterns from a set of raw data to transform data into meaningful information for better decision-making. However, Big Data is very complex and voluminous, and traditional methods of Data Mining are not capable to process and analyze this data efficiently. Data clustering, one of the main methods of data mining, eases the extraction of information from each cluster separately. Since 1960s, K-means algorithm has been known as one of the most classical techniques of data clustering. Even though there has been an extremely rich bibliography about improving the efficiency of K-means for years now, traditional K-means still suffers from some weaknesses, especially in dealing with Big Data. Despite many attempts to optimize K-means algorithm to handle Big Data using different techniques such as parallelization, the proposed methods are still not able to cluster Big Datasets efficiently due to lack of improvement in some effective parameters such as the number of clusters and the initial clusters' centroids. This study aims to understand the current limitations of K-means algorithm and to overcome the limitations in order to produce more efficient performance in clustering big datasets from healthcare domain. To develop the optimized extension of K-means algorithm, a systematic literature review (SLR) was conducted to investigate the current limitations and existing solutions for the K-means limitations over Big Data. Based on the the SLR, this study proposed an enhanced parallel version of K-means clustering algorithm to reduce the execution time of the clustering process over the big datasets with the minimum negative impact on the clustering’s accuracy. Determining the optimum number of clusters, obtaining the suitable initial centroids, and improving the process of parallelization were the three steps of the optimization process. To avoid any random results, the proposed hybrid solution defined the optimum number of clusters by using elbow method. In addition, the proposed algorithm obtained the ideal initial centroids by utilizing a careful seed selection method, performing K-means with a fuzzy technique to increase the precision of the clustering, and parallelizing the clustering process by using Hadoop platform with the optimized Map and Reduce functions to reduce the execution time of the process. The evaluation of the proposed algorithm revealed that the new method performed the clustering process over multiple big datasets with shorter execution time compared to the study’s benchmarks: Apache Mahout K-means, K-means++, and Fuzzy K-means. Also, the results of the three selected cluster validity indices - Silhouette, Dunn, and Davies-Bouldin - verified that there was no negative impact on the quality of the clusters. 2022 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/102827/1/ArdavanAshabiPRAZAK2022.pdf.pdf Ashabi, Ardavan (2022) Enhancement of parallel K-means algorithm for clustering big datasets. PhD thesis, Universiti Teknologi Malaysia. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:151605
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
language English
topic T Technology (General)
spellingShingle T Technology (General)
Ashabi, Ardavan
Enhancement of parallel K-means algorithm for clustering big datasets
description Big Data encompasses huge amounts of complex data which is generated in different areas such as business, marketing, educational systems, IoT, and healthcare. For instance, in the healthcare domain, huge amounts of data are generated daily from different sources such as health monitoring and medical diagnosis systems by health service providers. Data mining aims to extract meaningful and valuable patterns from a set of raw data to transform data into meaningful information for better decision-making. However, Big Data is very complex and voluminous, and traditional methods of Data Mining are not capable to process and analyze this data efficiently. Data clustering, one of the main methods of data mining, eases the extraction of information from each cluster separately. Since 1960s, K-means algorithm has been known as one of the most classical techniques of data clustering. Even though there has been an extremely rich bibliography about improving the efficiency of K-means for years now, traditional K-means still suffers from some weaknesses, especially in dealing with Big Data. Despite many attempts to optimize K-means algorithm to handle Big Data using different techniques such as parallelization, the proposed methods are still not able to cluster Big Datasets efficiently due to lack of improvement in some effective parameters such as the number of clusters and the initial clusters' centroids. This study aims to understand the current limitations of K-means algorithm and to overcome the limitations in order to produce more efficient performance in clustering big datasets from healthcare domain. To develop the optimized extension of K-means algorithm, a systematic literature review (SLR) was conducted to investigate the current limitations and existing solutions for the K-means limitations over Big Data. Based on the the SLR, this study proposed an enhanced parallel version of K-means clustering algorithm to reduce the execution time of the clustering process over the big datasets with the minimum negative impact on the clustering’s accuracy. Determining the optimum number of clusters, obtaining the suitable initial centroids, and improving the process of parallelization were the three steps of the optimization process. To avoid any random results, the proposed hybrid solution defined the optimum number of clusters by using elbow method. In addition, the proposed algorithm obtained the ideal initial centroids by utilizing a careful seed selection method, performing K-means with a fuzzy technique to increase the precision of the clustering, and parallelizing the clustering process by using Hadoop platform with the optimized Map and Reduce functions to reduce the execution time of the process. The evaluation of the proposed algorithm revealed that the new method performed the clustering process over multiple big datasets with shorter execution time compared to the study’s benchmarks: Apache Mahout K-means, K-means++, and Fuzzy K-means. Also, the results of the three selected cluster validity indices - Silhouette, Dunn, and Davies-Bouldin - verified that there was no negative impact on the quality of the clusters.
format Thesis
author Ashabi, Ardavan
author_facet Ashabi, Ardavan
author_sort Ashabi, Ardavan
title Enhancement of parallel K-means algorithm for clustering big datasets
title_short Enhancement of parallel K-means algorithm for clustering big datasets
title_full Enhancement of parallel K-means algorithm for clustering big datasets
title_fullStr Enhancement of parallel K-means algorithm for clustering big datasets
title_full_unstemmed Enhancement of parallel K-means algorithm for clustering big datasets
title_sort enhancement of parallel k-means algorithm for clustering big datasets
publishDate 2022
url http://eprints.utm.my/id/eprint/102827/1/ArdavanAshabiPRAZAK2022.pdf.pdf
http://eprints.utm.my/id/eprint/102827/
http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:151605
_version_ 1778160787666239488
score 13.160551