The Parallel Fuzzy C-Median Clustering Algorithm Using Spark for the Big Data
Big data for sustainable development is a global issue due to the explosive growth of data and according to the forecasting of International Data Corporation(IDC), the amount of data in the world will double every 18 months, and the Global Data-sphere is expected to more than double in size from...
Saved in:
Main Authors: | , , , |
---|---|
格式: | Article |
语言: | English English |
出版: |
IEEE
2024
|
主题: | |
在线阅读: | http://irep.iium.edu.my/118900/7/118900_%20The%20Parallel%20Fuzzy%20C-Median.pdf http://irep.iium.edu.my/118900/8/118900_%20The%20Parallel%20Fuzzy%20C-Median_Scopus.pdf http://irep.iium.edu.my/118900/ https://ieeexplore.ieee.org/abstract/document/10684199/ |
标签: |
添加标签
没有标签, 成为第一个标记此记录!
|
总结: | Big data for sustainable development is a global issue due to the explosive growth of
data and according to the forecasting of International Data Corporation(IDC), the amount of data in the
world will double every 18 months, and the Global Data-sphere is expected to more than double in size
from 2022 to 2026. The analysis, processing, and storing of big data is a challenging research concern due
to data imperfection, massive data size, computational difficulty, and lengthy evaluation time. Clustering
is a fundamental technique in data analysis and data mining, and it becomes particularly challenging when
dealing with big data due to the sheer volume, velocity, and variety of the data. Big Data frameworks like
Hadoop MapReduce and Spark are potent tools that provide an effective way to analyze huge datasets that
are being processed by the Hadoop cluster. Apache Spark is one of the most widely used large-scale data
processing engines due to its speed, low latency in-memory computing, and powerful analytics. Therefore,
we develop a Parallel Fuzzy C-Median Clustering Algorithm Using Spark for Big Data that can handle
large datasets while maintaining high accuracy and scalability. The algorithm employs a distance-based
clustering approach to determine the similarity between data points and group them in combination with
sampling and partitioning techniques. In the sampling phase, a representative subset of the dataset is
selected. In the partitioning phase, the data is partitioned into smaller subsets that can be clustered in parallel
across multiple nodes. The suggested method, implemented in the Databricks cloud platform provides high
clustering accuracy, as measured by clustering evaluation metrics such as the silhouette coefficient, cost
function, partition index, clustering entropy. The experimental results show that c=5, which is consistent
for cost function with the ideal silhouette coefficient of 1, is the optimal number of clusters for this dataset.
A comparative study is done to validate the proposed algorithm by implementing the other contemporary
algorithms for the same dataset. The comparison analysis exhibits that our suggested approach outperforms
the others, especially for computational time. The developed approach is benchmarked with the existing
methods such as MiniBatchKmeans, AffinityPropagation, SpectralClustering, Ward, OPTICS, and BRICH
in terms of silhouette index and cost function. |
---|