The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models
In practice, the large datasets contain various types of anomalous records that significantly complicate the analysis problem. In particular, the prevalence of outliers, missing or incomplete data can completely invalidate the results obtained with standard analysis procedures, often with no indicat...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English |
Published: |
2011
|
Subjects: | |
Online Access: | http://etd.uum.edu.my/2499/1/Munirah_Yahya.pdf http://etd.uum.edu.my/2499/2/1.Munirah_Yahya.pdf http://etd.uum.edu.my/2499/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my.uum.etd.2499 |
---|---|
record_format |
eprints |
spelling |
my.uum.etd.24992016-04-27T07:06:30Z http://etd.uum.edu.my/2499/ The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models Munirah, Yahya QA76 Computer software In practice, the large datasets contain various types of anomalous records that significantly complicate the analysis problem. In particular, the prevalence of outliers, missing or incomplete data can completely invalidate the results obtained with standard analysis procedures, often with no indication that anything is wrong. High quality of decision making actually rely on high quality data, therefore data preprocessing has become the essential and important base of DM with no doubt because of no quality data, mean no quality mining results. Data preprocessing consists of interactive step such as data cleaning, data transformation, data reduction and data discretization. Data mining model have been used for extensive analysis in researches or data analysis work as it able to spot subtle relationships and associations. Logistic regression is an important statistical method for modeling and predicting categorical data. Another technique can be used in data mining task is neural network (NN) which have been successfully applied in a wide range of supervised and unsupervised learning applications. This study explored on the use of data preprocessing techniques such as missing values treatment namely Mean of Attributes and Mean of Target. The experimental results indicate that for the Logistic Regression models, models higher average accuracy is shown by data whose missing values were treated as Mean of Attribute. However, for NN models both missing value treatment did not affect the NN models. Prior to NNs training, the data needs to be transformed into form that is acceptable as input to Multi Layer Perceptron(MLP) network. Hence, several normalization techniques had been explored to compare which techniques suitable in each of the three datasets. There are several normalization techniques used for the experimental setup that is Min-Max normalization, Z-Score normalization and Sigmoidal normalization. For Wisconsin Breast Cancer data, Min-Max is preferable. However, for Pima Indians Diabetes and Thyroid Disease data set, Sigmoidal normalization is more preferable than the rest of the method. Hence, the experimental results indicate that the performance of DM models depends not only on the missing value and normalization techniques, it also depends on the amount of missing value in the whole data set. 2011 Thesis NonPeerReviewed application/pdf en http://etd.uum.edu.my/2499/1/Munirah_Yahya.pdf application/pdf en http://etd.uum.edu.my/2499/2/1.Munirah_Yahya.pdf Munirah, Yahya (2011) The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models. Masters thesis, Universiti Utara Malaysia. |
institution |
Universiti Utara Malaysia |
building |
UUM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Utara Malaysia |
content_source |
UUM Electronic Theses |
url_provider |
http://etd.uum.edu.my/ |
language |
English English |
topic |
QA76 Computer software |
spellingShingle |
QA76 Computer software Munirah, Yahya The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models |
description |
In practice, the large datasets contain various types of anomalous records that significantly complicate the analysis problem. In particular, the prevalence of outliers, missing or incomplete data can completely invalidate the results obtained with standard analysis procedures, often with no indication that anything is wrong. High quality of decision making actually rely on high quality data, therefore data preprocessing has become the essential and important base of DM with no doubt because of no quality data, mean no quality mining results. Data preprocessing consists of interactive step such as data cleaning, data transformation, data reduction and data discretization. Data mining model have been used for extensive analysis in researches or data analysis work as it able to spot subtle relationships and associations. Logistic regression is an important statistical method for modeling and predicting categorical data. Another technique can be used in data mining task is neural network (NN) which have been successfully applied in a wide range of
supervised and unsupervised learning applications. This study explored on the use of data preprocessing techniques such as missing values treatment namely Mean of Attributes and Mean of Target. The experimental results indicate that for the Logistic Regression models, models higher average accuracy is shown by data whose missing values were treated as Mean of Attribute. However, for NN models both missing value treatment did not affect the NN models. Prior to NNs training, the data needs to be transformed into form that is
acceptable as input to Multi Layer Perceptron(MLP) network. Hence, several normalization techniques had been explored to compare which techniques suitable in each of the three datasets. There are several normalization techniques used for the experimental setup that is Min-Max normalization, Z-Score normalization and Sigmoidal normalization. For Wisconsin Breast Cancer data, Min-Max is preferable. However, for Pima Indians Diabetes and Thyroid Disease data set, Sigmoidal normalization is more preferable than the rest of the method. Hence, the experimental results indicate that the performance of DM models depends not only on the missing value and normalization techniques, it also
depends on the amount of missing value in the whole data set. |
format |
Thesis |
author |
Munirah, Yahya |
author_facet |
Munirah, Yahya |
author_sort |
Munirah, Yahya |
title |
The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models |
title_short |
The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models |
title_full |
The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models |
title_fullStr |
The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models |
title_full_unstemmed |
The Impact of Missing Value Methods and Normalization Techniques on the Performance of Data Mining Models |
title_sort |
impact of missing value methods and normalization techniques on the performance of data mining models |
publishDate |
2011 |
url |
http://etd.uum.edu.my/2499/1/Munirah_Yahya.pdf http://etd.uum.edu.my/2499/2/1.Munirah_Yahya.pdf http://etd.uum.edu.my/2499/ |
_version_ |
1644276708840308736 |
score |
13.160551 |