Feature selection by mutual information: robust ranking on high- dimension low-sample-size data
Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an opt...
Saved in:
Main Author: | |
---|---|
Format: | Final Year Project / Dissertation / Thesis |
Published: |
2024
|
Subjects: | |
Online Access: | http://eprints.utar.edu.my/7067/1/THE_1002128.pdf http://eprints.utar.edu.my/7067/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-utar-eprints.7067 |
---|---|
record_format |
eprints |
spelling |
my-utar-eprints.70672025-01-19T01:59:21Z Feature selection by mutual information: robust ranking on high- dimension low-sample-size data Chin, Fung Yuen HA Statistics Q Science (General) Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an optimal baseline for the data set will be proposed using the feature ranking method. To achieve this optimal baseline, a total number of features will be obtained at the same time to serve as the guideline on the number of features needed in a feature selection method. In addition, the high dimensional data which increases the difficulty on the features selection due to the curse of dimensionality. To overcome this problem, a robust feature selection algorithm, named ranked mutual information with support vector machine (rMI-SVM) can be applied on the data with missing value regardless of the linearity of the data set, as it does not require additional parameter or preset on the number of features needed. The features selected by rMI-SVM can avoid overfitting as the chosen candidate feature will provide new information to the predictive model. The receiver operating characteristic curve has been plotted to show the sensitivity of the model built by rMI-SVM compared to the regression method under the same number of features. Also, the Z- score graph was plotted to confirm that the features chosen by rMI-SVM were not selected by chance. The experimental results show that the proposed method can select a compact subset of features that can perform better than the benchmark of the data set and the optimal baseline proposed in this study. The biological meaning of the selected features confirmed that the selected features are related to the relevant disease. 2024 Final Year Project / Dissertation / Thesis NonPeerReviewed application/pdf http://eprints.utar.edu.my/7067/1/THE_1002128.pdf Chin, Fung Yuen (2024) Feature selection by mutual information: robust ranking on high- dimension low-sample-size data. PhD thesis, UTAR. http://eprints.utar.edu.my/7067/ |
institution |
Universiti Tunku Abdul Rahman |
building |
UTAR Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Tunku Abdul Rahman |
content_source |
UTAR Institutional Repository |
url_provider |
http://eprints.utar.edu.my |
topic |
HA Statistics Q Science (General) |
spellingShingle |
HA Statistics Q Science (General) Chin, Fung Yuen Feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
description |
Feature selection is a process of selecting a group of relevant features by removing unnecessary features for use in constructing the predictive model. The current benchmark for the data set is obtained by including all the features, such as redundancy and noise. Therefore, for this research, an optimal baseline for the data set will be proposed using the feature ranking method. To achieve this optimal baseline, a total number of features will be obtained at the same time to serve as the guideline on the number of features needed in a feature selection method. In addition, the high dimensional data which increases the difficulty
on the features selection due to the curse of dimensionality. To overcome this problem, a robust feature selection algorithm, named ranked mutual information
with support vector machine (rMI-SVM) can be applied on the data with missing value regardless of the linearity of the data set, as it does not require additional parameter or preset on the number of features needed. The features
selected by rMI-SVM can avoid overfitting as the chosen candidate feature will provide new information to the predictive model. The receiver operating characteristic curve has been plotted to show the sensitivity of the model built by rMI-SVM compared to the regression method under the same number of features. Also, the Z- score graph was plotted to confirm that the features chosen by rMI-SVM were not selected by chance. The experimental results show that
the proposed method can select a compact subset of features that can perform better than the benchmark of the data set and the optimal baseline proposed in this study. The biological meaning of the selected features confirmed that the selected features are related to the relevant disease.
|
format |
Final Year Project / Dissertation / Thesis |
author |
Chin, Fung Yuen |
author_facet |
Chin, Fung Yuen |
author_sort |
Chin, Fung Yuen |
title |
Feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
title_short |
Feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
title_full |
Feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
title_fullStr |
Feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
title_full_unstemmed |
Feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
title_sort |
feature selection by mutual information: robust ranking on high- dimension low-sample-size data |
publishDate |
2024 |
url |
http://eprints.utar.edu.my/7067/1/THE_1002128.pdf http://eprints.utar.edu.my/7067/ |
_version_ |
1822896906458628096 |
score |
13.23648 |