A web-based implementation of k-means algorithms
The K-means algorithm has been around for over a century. While a rather simplistic and dated algorithm, it remains widely used and taught till this day. The K-means algorithm requires two inputs for it to be applied onto a data set, the value K, and a proximity measure. Picking the right inputs is...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project / Dissertation / Thesis |
Published: |
2022
|
Subjects: | |
Online Access: | http://eprints.utar.edu.my/5010/1/1801846_LEE_QUAN.pdf http://eprints.utar.edu.my/5010/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The K-means algorithm has been around for over a century. While a rather simplistic and dated algorithm, it remains widely used and taught till this day. The K-means algorithm requires two inputs for it to be applied onto a data set, the value K, and a proximity measure. Picking the right inputs is of utmost importance if one wishes to achieve good results with the algorithm, especially the proximity measure. There are plenty of different proximity measures available in the world, all of them best suited for different types of applications and data sets. Yet knowing this, most modern data mining tools only offer a handful of proximity measures to the user, with the most common ones being Euclidean distance and Manhattan distance. This stinginess of proximity measures in data mining tools is stifling the performance of the algorithm. This is where k-luster comes in. k-luster, the web application developed as a result of this project, implements the K-means and K-means++ algorithm along with ten proximity measures, seven of which are distance measures and whereas the remaining three are similarity measures. The project was planned using the Kanban development methodology, and was built using HTML, CSS, JavaScript, Django, NumPy and pandas. The completed web application is then hosted on Heroku. k-luster allows users to upload their own data set, or choose from one of three samples if they just want to try out the application. Playing around with different settings and comparing the results obtained, it is clear how large of an impact choosing the right proximity measure can make. In conclusion, this project has accomplished what it first set out to achieve. However, there is still much room for improvement. Firstly, k-luster could incorporate additional clustering algorithms, or even classification algorithms in the future. Furthermore, the web application could save the users’ past work, so that they may resume their work at a later time without skipping a beat. |
---|