Feature selection by mutual information: robust ranking on high- dimension low-sample-size data

Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an opt...

Full description

Saved in:
Bibliographic Details
Main Author: Chin, Fung Yuen
Format: Final Year Project / Dissertation / Thesis
Published: 2024
Subjects:
Online Access:http://eprints.utar.edu.my/7067/1/THE_1002128.pdf
http://eprints.utar.edu.my/7067/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utar-eprints.7067
record_format eprints
spelling my-utar-eprints.70672025-01-19T01:59:21Z Feature selection by mutual information: robust ranking on high- dimension low-sample-size data Chin, Fung Yuen HA Statistics Q Science (General) Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an optimal baseline for the data set will be proposed using the feature ranking method. To achieve this optimal baseline, a total number of features will be obtained at the same time to serve as the guideline on the number of features needed in a feature selection method. In addition, the high dimensional data which increases the difficulty on the features selection due to the curse of dimensionality. To overcome this problem, a robust feature selection algorithm, named ranked mutual information with support vector machine (rMI-SVM) can be applied on the data with missing value regardless of the linearity of the data set, as it does not require additional parameter or preset on the number of features needed. The features selected by rMI-SVM can avoid overfitting as the chosen candidate feature will provide new information to the predictive model. The receiver operating characteristic curve has been plotted to show the sensitivity of the model built by rMI-SVM compared to the regression method under the same number of features. Also, the Z- score graph was plotted to confirm that the features chosen by rMI-SVM were not selected by chance. The experimental results show that the proposed method can select a compact subset of features that can perform better than the benchmark of the data set and the optimal baseline proposed in this study. The biological meaning of the selected features confirmed that the selected features are related to the relevant disease. 2024 Final Year Project / Dissertation / Thesis NonPeerReviewed application/pdf http://eprints.utar.edu.my/7067/1/THE_1002128.pdf Chin, Fung Yuen (2024) Feature selection by mutual information: robust ranking on high- dimension low-sample-size data. PhD thesis, UTAR. http://eprints.utar.edu.my/7067/
institution Universiti Tunku Abdul Rahman
building UTAR Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Tunku Abdul Rahman
content_source UTAR Institutional Repository
url_provider http://eprints.utar.edu.my
topic HA Statistics
Q Science (General)
spellingShingle HA Statistics
Q Science (General)
Chin, Fung Yuen
Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
description Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an optimal baseline for the data set will be proposed using the feature ranking method. To achieve this optimal baseline, a total number of features will be obtained at the same time to serve as the guideline on the number of features needed in a feature selection method. In addition, the high dimensional data which increases the difficulty on the features selection due to the curse of dimensionality. To overcome this problem, a robust feature selection algorithm, named ranked mutual information with support vector machine (rMI-SVM) can be applied on the data with missing value regardless of the linearity of the data set, as it does not require additional parameter or preset on the number of features needed. The features selected by rMI-SVM can avoid overfitting as the chosen candidate feature will provide new information to the predictive model. The receiver operating characteristic curve has been plotted to show the sensitivity of the model built by rMI-SVM compared to the regression method under the same number of features. Also, the Z- score graph was plotted to confirm that the features chosen by rMI-SVM were not selected by chance. The experimental results show that the proposed method can select a compact subset of features that can perform better than the benchmark of the data set and the optimal baseline proposed in this study. The biological meaning of the selected features confirmed that the selected features are related to the relevant disease.
format Final Year Project / Dissertation / Thesis
author Chin, Fung Yuen
author_facet Chin, Fung Yuen
author_sort Chin, Fung Yuen
title Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
title_short Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
title_full Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
title_fullStr Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
title_full_unstemmed Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
title_sort feature selection by mutual information: robust ranking on high- dimension low-sample-size data
publishDate 2024
url http://eprints.utar.edu.my/7067/1/THE_1002128.pdf
http://eprints.utar.edu.my/7067/
_version_ 1822896906458628096
score 13.23648