A noun-based feature location approach supported by time aware term-weighting technique for facilitating software maintenance / Sima Zamani
Feature location is one of the frequent software maintenance activities that aims to identify a source code location pertinent to a software feature. Most of the proposed feature location approaches are based, at least in part, on text analysis to determine the similarity of a new feature with the s...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Published: |
2016
|
Subjects: | |
Online Access: | http://studentsrepo.um.edu.my/10764/2/Sima.pdf http://studentsrepo.um.edu.my/10764/1/Sima_Zamani_%E2%80%93_Thesis.pdf http://studentsrepo.um.edu.my/10764/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Feature location is one of the frequent software maintenance activities that aims to identify a source code location pertinent to a software feature. Most of the proposed feature location approaches are based, at least in part, on text analysis to determine the similarity of a new feature with the source code data. However, the text analysis methods used in feature location originate from the natural language context. Unlike the typical context in which these methods are applied, text documents in software repositories, such as source code files, have a corresponding set of metadata including such items as time- stamps, developer identifiers, and commit comments. Furthermore, the history of changes of the source code is recorded in the repositories that leads to a larger dataset size. Due to these differences between the contexts in software repositories and natural language, the text analysis does not utilize its possible potential for accurately locating software features. Accordingly, the goal of this thesis is to improve feature location by addressing the specific characteristics of the repositories’ text data, i.e. incorporation of the data with metadata and larger dataset size, within the text analysis process. In this thesis, a new feature location approach is proposed that considers the metadata of time and developer, and uses only the nouns. The proposed approach analyzes and weights the data from the aspect of time when the data was recorded and the aspect of developer who recorded the data in the repository. In this approach, first, a time- and developer-based corpus is created from the nouns extracted from the repository’s data. Then, the nouns are weighted using two term-weighting techniques including a time-aware term-weighting technique and a developers-based time-aware term-weighting technique. Next, the calculated weights for each noun are combined to obtain the total noun’s weight. Finally, the source code files were ranked based on the summation of the total weights of the nouns that appeared in both the given software feature and the source code files. The empirical evaluation of the proposed approach on a set of open-source projects indicates remarkable improvements over the feature location baseline approaches that utilize VSM (Vector Space Model) and SUM (Smoothed Unigram Model). The proposed approach outperforms the accuracy, effectiveness and performance of the feature location baseline approaches as much as 62%, 43% and 30%, respectively. In this approach, the time-based analysis and weighting of the data make an improvement over the baseline approaches up to 38%, 35% and 19%, respectively; whereas the developer-based analysis and weighting of the data make an improvement up to 55%, 39% and 29%, respectively. Furthermore, the use of nouns-only, instead of using all types of terms, improves the accuracy, effectiveness and performance as much as 26%, 49% and 23%, respectively and reduces the dataset size up to 60%. The statistical analysis of the experimental results demonstrates the significance of the improvement in all aspects. In general, consideration of time-metadata and developer-metadata in analyzing and weighting the data, along with the use of only the nouns, makes significant improvements to feature location. |
---|