Information Theoretic-based Feature Selection for Machine Learning

Three major factors that determine the performance of a machine learning are the choice of a representative set of features, choosing a suitable machine learning algorithm and the right selection of the training parameters for a specified machine learning algorithm. This thesis tackles the proble...

Full description

Saved in:
Bibliographic Details
Main Author: Muhammad Aliyu, Sulaiman
Format: Thesis
Language:English
English
Published: Universiti Malaysia Sarawak (UNIMAS) 2018
Subjects:
Online Access:http://ir.unimas.my/id/eprint/26595/1/Information%20Theoretic-based%20Feature%2024pgs.pdf
http://ir.unimas.my/id/eprint/26595/4/Information%20Theoretic-based%20Feature%20ft.pdf
http://ir.unimas.my/id/eprint/26595/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.unimas.ir.26595
record_format eprints
spelling my.unimas.ir.265952023-03-23T07:47:33Z http://ir.unimas.my/id/eprint/26595/ Information Theoretic-based Feature Selection for Machine Learning Muhammad Aliyu, Sulaiman QA75 Electronic computers. Computer science Three major factors that determine the performance of a machine learning are the choice of a representative set of features, choosing a suitable machine learning algorithm and the right selection of the training parameters for a specified machine learning algorithm. This thesis tackles the problem of feature selection for supervised machine learning prediction tasks through dependency information. The feature evaluation strategy is formulated based on mutual information (MI) to handles both classification and regression supervised learning tasks and the search strategy is a modified greedy forward strategy designed to manage redundancy between features and avoiding features that are irrelevant to the predicting output. The problem with many existing feature selections that evaluate features based on mutual information is that they are designed to handles classification tasks only. And the few existing ones that can work for regression tasks were recently found to underestimate mutual information between two strongly dependent variables. In addition to these problems, the search strategy which is usually a heuristic greedy method used with many existing feature selections, lacks scientifically sound stopping criterion and the forward greedy procedure despite its advantages over the backward procedure is found to reveal suboptimal. Thus, this thesis has developed and evaluated a filter based Information Theoretic-based Feature Selection (IFS) for machine learning. Various experiments were carried out to assess and test components of IFS algorithm. The first test was designed to evaluate the formulated IFS Selection Criterion Strategy (MI estimator) by comparing it with six different MI estimator benchmarks. The second test evaluates IFS in a controlled study using simulated datasets. Moreover, the third test used ten natural domain datasets obtained from UCI Repository, in about fifteen different experiments, using three to four different Machine Learning Algorithms for performance evaluation. Also, additional experiments to compare the relative performance of the IFS with five related feature selection algorithms were carried out using natural domain datasets. Besides, this thesis developed a hybrid filter method to enhance the performance of the IFS. IFS served as filter together with an Ant Colony Optimization System (ACO) as a metaheuristic form the hybrid system. In these extended IFS method, feature selection method was defined and presented as a 0-1 Knapsack Problem (MKP). Thus, this thesis precisely developed and evaluated IFS_BACS (Binary Ant Colony System) hybrid method. Further experiments were carried out using the natural domain datasets and comparison were made between IFS and hybrid IFS_BACS methods. In most of the cases, experimental results of IFS and its extended IFS_BACS hybrid method significantly reduced features and produce competitive performance accuracy when compared to the results of the full feature set before applying the IFS or IFS_BACS method. And comparing the IFS with its extended version, the extended version (IFS_BACS) seems to be more promising in selecting optimal feature subset from large datasets. Universiti Malaysia Sarawak (UNIMAS) 2018 Thesis NonPeerReviewed text en http://ir.unimas.my/id/eprint/26595/1/Information%20Theoretic-based%20Feature%2024pgs.pdf text en http://ir.unimas.my/id/eprint/26595/4/Information%20Theoretic-based%20Feature%20ft.pdf Muhammad Aliyu, Sulaiman (2018) Information Theoretic-based Feature Selection for Machine Learning. PhD thesis, Universiti Malaysia Sarawak (UNIMAS).
institution Universiti Malaysia Sarawak
building Centre for Academic Information Services (CAIS)
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sarawak
content_source UNIMAS Institutional Repository
url_provider http://ir.unimas.my/
language English
English
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Muhammad Aliyu, Sulaiman
Information Theoretic-based Feature Selection for Machine Learning
description Three major factors that determine the performance of a machine learning are the choice of a representative set of features, choosing a suitable machine learning algorithm and the right selection of the training parameters for a specified machine learning algorithm. This thesis tackles the problem of feature selection for supervised machine learning prediction tasks through dependency information. The feature evaluation strategy is formulated based on mutual information (MI) to handles both classification and regression supervised learning tasks and the search strategy is a modified greedy forward strategy designed to manage redundancy between features and avoiding features that are irrelevant to the predicting output. The problem with many existing feature selections that evaluate features based on mutual information is that they are designed to handles classification tasks only. And the few existing ones that can work for regression tasks were recently found to underestimate mutual information between two strongly dependent variables. In addition to these problems, the search strategy which is usually a heuristic greedy method used with many existing feature selections, lacks scientifically sound stopping criterion and the forward greedy procedure despite its advantages over the backward procedure is found to reveal suboptimal. Thus, this thesis has developed and evaluated a filter based Information Theoretic-based Feature Selection (IFS) for machine learning. Various experiments were carried out to assess and test components of IFS algorithm. The first test was designed to evaluate the formulated IFS Selection Criterion Strategy (MI estimator) by comparing it with six different MI estimator benchmarks. The second test evaluates IFS in a controlled study using simulated datasets. Moreover, the third test used ten natural domain datasets obtained from UCI Repository, in about fifteen different experiments, using three to four different Machine Learning Algorithms for performance evaluation. Also, additional experiments to compare the relative performance of the IFS with five related feature selection algorithms were carried out using natural domain datasets. Besides, this thesis developed a hybrid filter method to enhance the performance of the IFS. IFS served as filter together with an Ant Colony Optimization System (ACO) as a metaheuristic form the hybrid system. In these extended IFS method, feature selection method was defined and presented as a 0-1 Knapsack Problem (MKP). Thus, this thesis precisely developed and evaluated IFS_BACS (Binary Ant Colony System) hybrid method. Further experiments were carried out using the natural domain datasets and comparison were made between IFS and hybrid IFS_BACS methods. In most of the cases, experimental results of IFS and its extended IFS_BACS hybrid method significantly reduced features and produce competitive performance accuracy when compared to the results of the full feature set before applying the IFS or IFS_BACS method. And comparing the IFS with its extended version, the extended version (IFS_BACS) seems to be more promising in selecting optimal feature subset from large datasets.
format Thesis
author Muhammad Aliyu, Sulaiman
author_facet Muhammad Aliyu, Sulaiman
author_sort Muhammad Aliyu, Sulaiman
title Information Theoretic-based Feature Selection for Machine Learning
title_short Information Theoretic-based Feature Selection for Machine Learning
title_full Information Theoretic-based Feature Selection for Machine Learning
title_fullStr Information Theoretic-based Feature Selection for Machine Learning
title_full_unstemmed Information Theoretic-based Feature Selection for Machine Learning
title_sort information theoretic-based feature selection for machine learning
publisher Universiti Malaysia Sarawak (UNIMAS)
publishDate 2018
url http://ir.unimas.my/id/eprint/26595/1/Information%20Theoretic-based%20Feature%2024pgs.pdf
http://ir.unimas.my/id/eprint/26595/4/Information%20Theoretic-based%20Feature%20ft.pdf
http://ir.unimas.my/id/eprint/26595/
_version_ 1761623574118924288
score 13.214268