Filter-Based Gene Selection Method for Tissues Classification on Large Scale Gene Expression Data

DNA microarray technology is a current innovative tool that has offers a new perspective to look sight into cellular systems and measure a large scale of gene expressions at once. Regardless the novel invention of DNA microarray, most of its results relies on the computational intelligence power, wh...

Full description

Saved in:
Bibliographic Details
Main Authors: Kabir Ahmad, Farzana, Yusof, Yuhanis, Yusoff, Nooraini
Format: Article
Published: Science Publishing Corporation Inc 2018
Subjects:
Online Access:http://repo.uum.edu.my/25273/
http://doi.org/10.14419/ijet.v7i2.15.11216
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:DNA microarray technology is a current innovative tool that has offers a new perspective to look sight into cellular systems and measure a large scale of gene expressions at once. Regardless the novel invention of DNA microarray, most of its results relies on the computational intelligence power, which is used to interpret the large number of data. At present, interpreting large scale of gene expression data remain a thought-provoking issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p. In addition, this data are often overwhelmed, over fitting and confused by the complexity of data analysis. Due to the nature of this microarray data, it is also common that a large number of genes may not be informative for classification purposes. For such a reason, many studies have used feature selection methods to select significant genes that present the maximum discriminative power between cancerous and normal tissues. In this study, we aim to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. Two common classifiers, Support Vector Machine (SVM) and Decision Tree (C4.5) are used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM while IG fit for C4.5. In a colon dataset, SVM has achieved a specificity of 86% with SNR while and 80% for IG. In contract, C4.5 has obtained a specificity of 78% for IG on the identical dataset. These results indicate that SVM performed slightly better with IG pre-processed data compare to C4.5 on the same dataset.