Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix

Software defect prediction provides actionable outputs to software teams while contributing to industrial success. Therefore, predicting the number of defects in a new version of software at both the class and method levels is an important goal of defect prediction studies to assist software team...

Full description

Saved in:
Bibliographic Details
Main Author: Ebubeogu Amarachukwu , Felix
Format: Thesis
Published: 2020
Subjects:
Online Access:http://studentsrepo.um.edu.my/14571/2/Ebubegogu.pdf
http://studentsrepo.um.edu.my/14571/1/Ebubeogu.pdf
http://studentsrepo.um.edu.my/14571/
Tags: Add Tag
No Tags, Be the first to tag this record!
id my.um.stud.14571
record_format eprints
spelling my.um.stud.145712023-07-04T23:29:19Z Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix Ebubeogu Amarachukwu , Felix QA76 Computer software TA Engineering (General). Civil engineering (General) Software defect prediction provides actionable outputs to software teams while contributing to industrial success. Therefore, predicting the number of defects in a new version of software at both the class and method levels is an important goal of defect prediction studies to assist software teams in optimizing their test efforts towards improving software quality. However, despite remarkable achievements in defect prediction, the quality of the data applied in defect prediction studies has been a major concern, with related quality issues leading to numerous contradictory findings in machine learning research. In addition, a demonstrated approach for predicting the number of defects in a new software version is lacking. Therefore, efforts are required to demonstrate how class- and method-level defect prediction can be achieved for a new software version and to develop an approach for preprocessing the highly imbalanced class- and method-level data available for software defect prediction. To address these issues, first, a data preprocessing framework is proposed to overcome some of the challenges associated with typical software datasets, for instance, irrelevant and redundant features. A machine-learning-driven, supervised optimal decision procedure is followed in the development of this data preprocessing framework, resulting in a prime advantage of bias-free method- and class-level datasets. Second, a method of predicting the number of software defects in an upcoming product release is proposed using predictor variables derived from the defect acceleration observed based on the existing software defects, namely, the defect density, defect velocity and defect introduction time. The number of defects in the current version of a software product is characterized by this defect acceleration; hence, these derived predictor variables can be used to construct regression models to predict the number of software defects in a new version. An experiment conducted on 69 open-source ELFF Java projects, containing 131,034 classes and 289,132 methods, as well as on the NASA datasets, which contain 10 different Java and C++ projects with 22,838 classes, is reported. To evaluate the effectiveness of the proposed framework for data preprocessing, the average classification performances of six selected state-of-the-art classifiers before and after data preprocessing are investigated and compared across multiple projects with data imbalances between the defective and defect-free classes. For both the class and method levels, these selected state-of-the-art classifiers, namely, naïve Bayes, logistic regression, neural network, K-nearest neighbors, support vector machine and random forest classifiers, achieve noteworthy performance when applied to preprocessed datasets. Moreover, for the ELFF projects, the results at the class and method levels respectively show correlation coefficients of 61% and 60% for the defect density, -11% and -4% for the defect introduction time, and 94% and 93% for the defect velocity (consistent results are also obtained for the NASA datasets, as presented in the results section). The proposed approach can serve as a blueprint for program testing to enhance the effectiveness of software development activities. 2020-05 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/14571/2/Ebubegogu.pdf application/pdf http://studentsrepo.um.edu.my/14571/1/Ebubeogu.pdf Ebubeogu Amarachukwu , Felix (2020) Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix. PhD thesis, Universiti Malaya. http://studentsrepo.um.edu.my/14571/
institution Universiti Malaya
building UM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaya
content_source UM Student Repository
url_provider http://studentsrepo.um.edu.my/
topic QA76 Computer software
TA Engineering (General). Civil engineering (General)
spellingShingle QA76 Computer software
TA Engineering (General). Civil engineering (General)
Ebubeogu Amarachukwu , Felix
Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix
description Software defect prediction provides actionable outputs to software teams while contributing to industrial success. Therefore, predicting the number of defects in a new version of software at both the class and method levels is an important goal of defect prediction studies to assist software teams in optimizing their test efforts towards improving software quality. However, despite remarkable achievements in defect prediction, the quality of the data applied in defect prediction studies has been a major concern, with related quality issues leading to numerous contradictory findings in machine learning research. In addition, a demonstrated approach for predicting the number of defects in a new software version is lacking. Therefore, efforts are required to demonstrate how class- and method-level defect prediction can be achieved for a new software version and to develop an approach for preprocessing the highly imbalanced class- and method-level data available for software defect prediction. To address these issues, first, a data preprocessing framework is proposed to overcome some of the challenges associated with typical software datasets, for instance, irrelevant and redundant features. A machine-learning-driven, supervised optimal decision procedure is followed in the development of this data preprocessing framework, resulting in a prime advantage of bias-free method- and class-level datasets. Second, a method of predicting the number of software defects in an upcoming product release is proposed using predictor variables derived from the defect acceleration observed based on the existing software defects, namely, the defect density, defect velocity and defect introduction time. The number of defects in the current version of a software product is characterized by this defect acceleration; hence, these derived predictor variables can be used to construct regression models to predict the number of software defects in a new version. An experiment conducted on 69 open-source ELFF Java projects, containing 131,034 classes and 289,132 methods, as well as on the NASA datasets, which contain 10 different Java and C++ projects with 22,838 classes, is reported. To evaluate the effectiveness of the proposed framework for data preprocessing, the average classification performances of six selected state-of-the-art classifiers before and after data preprocessing are investigated and compared across multiple projects with data imbalances between the defective and defect-free classes. For both the class and method levels, these selected state-of-the-art classifiers, namely, naïve Bayes, logistic regression, neural network, K-nearest neighbors, support vector machine and random forest classifiers, achieve noteworthy performance when applied to preprocessed datasets. Moreover, for the ELFF projects, the results at the class and method levels respectively show correlation coefficients of 61% and 60% for the defect density, -11% and -4% for the defect introduction time, and 94% and 93% for the defect velocity (consistent results are also obtained for the NASA datasets, as presented in the results section). The proposed approach can serve as a blueprint for program testing to enhance the effectiveness of software development activities.
format Thesis
author Ebubeogu Amarachukwu , Felix
author_facet Ebubeogu Amarachukwu , Felix
author_sort Ebubeogu Amarachukwu , Felix
title Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix
title_short Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix
title_full Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix
title_fullStr Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix
title_full_unstemmed Supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / Ebubeogu Amarachukwu Felix
title_sort supervised optimal decision machine learning approach to class- and method-level data preprocessing towards effective software defect prediction / ebubeogu amarachukwu felix
publishDate 2020
url http://studentsrepo.um.edu.my/14571/2/Ebubegogu.pdf
http://studentsrepo.um.edu.my/14571/1/Ebubeogu.pdf
http://studentsrepo.um.edu.my/14571/
_version_ 1772811929506545664
score 13.211869