Information Theoretic-based Feature Selection for Machine Learning
Three major factors that determine the performance of a machine learning are the choice of a representative set of features, choosing a suitable machine learning algorithm and the right selection of the training parameters for a specified machine learning algorithm. This thesis tackles the proble...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English |
Published: |
Universiti Malaysia Sarawak (UNIMAS)
2018
|
Subjects: | |
Online Access: | http://ir.unimas.my/id/eprint/26595/1/Information%20Theoretic-based%20Feature%2024pgs.pdf http://ir.unimas.my/id/eprint/26595/4/Information%20Theoretic-based%20Feature%20ft.pdf http://ir.unimas.my/id/eprint/26595/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.unimas.ir.26595 |
---|---|
record_format |
eprints |
spelling |
my.unimas.ir.265952023-03-23T07:47:33Z http://ir.unimas.my/id/eprint/26595/ Information Theoretic-based Feature Selection for Machine Learning Muhammad Aliyu, Sulaiman QA75 Electronic computers. Computer science Three major factors that determine the performance of a machine learning are the choice of a representative set of features, choosing a suitable machine learning algorithm and the right selection of the training parameters for a specified machine learning algorithm. This thesis tackles the problem of feature selection for supervised machine learning prediction tasks through dependency information. The feature evaluation strategy is formulated based on mutual information (MI) to handles both classification and regression supervised learning tasks and the search strategy is a modified greedy forward strategy designed to manage redundancy between features and avoiding features that are irrelevant to the predicting output. The problem with many existing feature selections that evaluate features based on mutual information is that they are designed to handles classification tasks only. And the few existing ones that can work for regression tasks were recently found to underestimate mutual information between two strongly dependent variables. In addition to these problems, the search strategy which is usually a heuristic greedy method used with many existing feature selections, lacks scientifically sound stopping criterion and the forward greedy procedure despite its advantages over the backward procedure is found to reveal suboptimal. Thus, this thesis has developed and evaluated a filter based Information Theoretic-based Feature Selection (IFS) for machine learning. Various experiments were carried out to assess and test components of IFS algorithm. The first test was designed to evaluate the formulated IFS Selection Criterion Strategy (MI estimator) by comparing it with six different MI estimator benchmarks. The second test evaluates IFS in a controlled study using simulated datasets. Moreover, the third test used ten natural domain datasets obtained from UCI Repository, in about fifteen different experiments, using three to four different Machine Learning Algorithms for performance evaluation. Also, additional experiments to compare the relative performance of the IFS with five related feature selection algorithms were carried out using natural domain datasets. Besides, this thesis developed a hybrid filter method to enhance the performance of the IFS. IFS served as filter together with an Ant Colony Optimization System (ACO) as a metaheuristic form the hybrid system. In these extended IFS method, feature selection method was defined and presented as a 0-1 Knapsack Problem (MKP). Thus, this thesis precisely developed and evaluated IFS_BACS (Binary Ant Colony System) hybrid method. Further experiments were carried out using the natural domain datasets and comparison were made between IFS and hybrid IFS_BACS methods. In most of the cases, experimental results of IFS and its extended IFS_BACS hybrid method significantly reduced features and produce competitive performance accuracy when compared to the results of the full feature set before applying the IFS or IFS_BACS method. And comparing the IFS with its extended version, the extended version (IFS_BACS) seems to be more promising in selecting optimal feature subset from large datasets. Universiti Malaysia Sarawak (UNIMAS) 2018 Thesis NonPeerReviewed text en http://ir.unimas.my/id/eprint/26595/1/Information%20Theoretic-based%20Feature%2024pgs.pdf text en http://ir.unimas.my/id/eprint/26595/4/Information%20Theoretic-based%20Feature%20ft.pdf Muhammad Aliyu, Sulaiman (2018) Information Theoretic-based Feature Selection for Machine Learning. PhD thesis, Universiti Malaysia Sarawak (UNIMAS). |
institution |
Universiti Malaysia Sarawak |
building |
Centre for Academic Information Services (CAIS) |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaysia Sarawak |
content_source |
UNIMAS Institutional Repository |
url_provider |
http://ir.unimas.my/ |
language |
English English |
topic |
QA75 Electronic computers. Computer science |
spellingShingle |
QA75 Electronic computers. Computer science Muhammad Aliyu, Sulaiman Information Theoretic-based Feature Selection for Machine Learning |
description |
Three major factors that determine the performance of a machine learning are the choice of a
representative set of features, choosing a suitable machine learning algorithm and the right
selection of the training parameters for a specified machine learning algorithm. This thesis
tackles the problem of feature selection for supervised machine learning prediction tasks
through dependency information. The feature evaluation strategy is formulated based on
mutual information (MI) to handles both classification and regression supervised learning
tasks and the search strategy is a modified greedy forward strategy designed to manage
redundancy between features and avoiding features that are irrelevant to the predicting output.
The problem with many existing feature selections that evaluate features based on mutual
information is that they are designed to handles classification tasks only. And the few existing
ones that can work for regression tasks were recently found to underestimate mutual
information between two strongly dependent variables. In addition to these problems, the
search strategy which is usually a heuristic greedy method used with many existing feature
selections, lacks scientifically sound stopping criterion and the forward greedy procedure
despite its advantages over the backward procedure is found to reveal suboptimal. Thus, this
thesis has developed and evaluated a filter based Information Theoretic-based Feature
Selection (IFS) for machine learning. Various experiments were carried out to assess and test
components of IFS algorithm. The first test was designed to evaluate the formulated IFS
Selection Criterion Strategy (MI estimator) by comparing it with six different MI estimator
benchmarks. The second test evaluates IFS in a controlled study using simulated datasets.
Moreover, the third test used ten natural domain datasets obtained from UCI Repository, in
about fifteen different experiments, using three to four different Machine Learning Algorithms for performance evaluation. Also, additional experiments to compare the relative
performance of the IFS with five related feature selection algorithms were carried out using
natural domain datasets. Besides, this thesis developed a hybrid filter method to enhance the
performance of the IFS. IFS served as filter together with an Ant Colony Optimization
System (ACO) as a metaheuristic form the hybrid system. In these extended IFS method,
feature selection method was defined and presented as a 0-1 Knapsack Problem (MKP). Thus,
this thesis precisely developed and evaluated IFS_BACS (Binary Ant Colony System) hybrid
method. Further experiments were carried out using the natural domain datasets and
comparison were made between IFS and hybrid IFS_BACS methods. In most of the cases,
experimental results of IFS and its extended IFS_BACS hybrid method significantly reduced
features and produce competitive performance accuracy when compared to the results of the
full feature set before applying the IFS or IFS_BACS method. And comparing the IFS with its
extended version, the extended version (IFS_BACS) seems to be more promising in selecting
optimal feature subset from large datasets. |
format |
Thesis |
author |
Muhammad Aliyu, Sulaiman |
author_facet |
Muhammad Aliyu, Sulaiman |
author_sort |
Muhammad Aliyu, Sulaiman |
title |
Information Theoretic-based Feature Selection for Machine Learning |
title_short |
Information Theoretic-based Feature Selection for Machine Learning |
title_full |
Information Theoretic-based Feature Selection for Machine Learning |
title_fullStr |
Information Theoretic-based Feature Selection for Machine Learning |
title_full_unstemmed |
Information Theoretic-based Feature Selection for Machine Learning |
title_sort |
information theoretic-based feature selection for machine learning |
publisher |
Universiti Malaysia Sarawak (UNIMAS) |
publishDate |
2018 |
url |
http://ir.unimas.my/id/eprint/26595/1/Information%20Theoretic-based%20Feature%2024pgs.pdf http://ir.unimas.my/id/eprint/26595/4/Information%20Theoretic-based%20Feature%20ft.pdf http://ir.unimas.my/id/eprint/26595/ |
_version_ |
1761623574118924288 |
score |
13.214268 |