Genetic algorithm based ensemble framework for sentiment analysis

Sentiment Analysis is the task of classifying opinion documents into the classes of positive or negative classes. Machine Learning classification is commonly used in sentiment analysis and it requires plain text documents to be transformed to analyzable data through feature extraction and selection....

Full description

Saved in:
Bibliographic Details
Main Author: Lai, Po Hung
Format: Thesis
Language:English
Published: 2018
Online Access:https://eprints.ums.edu.my/id/eprint/26672/1/Genetic%20algorithm%20based%20ensemble%20framework%20for%20sentiment%20analysis.pdf
https://eprints.ums.edu.my/id/eprint/26672/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Sentiment Analysis is the task of classifying opinion documents into the classes of positive or negative classes. Machine Learning classification is commonly used in sentiment analysis and it requires plain text documents to be transformed to analyzable data through feature extraction and selection. Feature extraction produces various representations of plain text documents whereas feature selection selects the features that are useful and relevant to the classification task. Producing various text representations can enrich the information contained in datasets to bring about more accurate results. Selecting relevant features for classification makes the whole classification process more efficient as it reduces the size of the feature set but maintains the quality of the feature set. Ensemble classifier is made up of multiple classifiers that work together to produce more desirable results based on the theory that the strong can make up for the weak. Extending the concept of ensemble classifiers, this research applies the concept on the feature extraction and feature selection steps too, creating a multilayered ensemble of the three main tasks in machine learning sentiment analysis. Since there are many methods involved in each task of the multilayered ensemble, genetic algorithm is added to optimize the overall framework in order to select the optimal combinations of methods in each layer that can produce satisfactory results. The objectives of this work include to investigate the effects of diversifying the feature selection methods on the sentiment analysis results and estimate suitable thresholds to apply, to investigate the effects of diversifying the feature extraction methods on the sentiment analysis results and lastly to implement and evaluate the performance of a multi-layer ensemble framework with optimization algorithm to produce accurate results. Prior to developing the whole multilayered ensemble framework, two separate experiments were performed to evaluate and study the different methods of feature extraction and selection. Methods of feature extraction can be separated into word based and phrase based. Phrase based features generally performed better, however the feature sets they produce are much larger than word based but this is where feature selection is helpful. For feature selection, the methods involved generally compares the feature appearance frequency in positive and negative documents and the methods that compares both feature presence and absence in documents of different classes produced better results. Finally, the last experiment involves creating the complete optimized multilayered ensemble framework and implementation of the framework to find the suitable combination of methods in each layer to produce satisfactory sentiment analysis accuracy. From the results, the framework was able to suggest combination of methods which produced accurate results with reduced feature sets.