Staff View: Human emotion recognition from videos using spatio-temporal and audio features

Human emotion recognition from videos using spatio-temporal and audio features

In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first...

Full description

Saved in:

Bibliographic Details
Main Authors:	Rashid, M., Abu-Bakar, S. A. R., Mokji, M.
Format:	Conference or Workshop Item
Published:	2013
Subjects:	TK Electrical engineering. Electronics Nuclear engineering
Online Access:	http://eprints.utm.my/id/eprint/51104/ http://dx.doi.org/ 10.1007/s00371-012-0768-y
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.utm.51104
record_format	eprints
spelling	my.utm.511042017-08-20T04:08:01Z http://eprints.utm.my/id/eprint/51104/ Human emotion recognition from videos using spatio-temporal and audio features Rashid, M. Abu-Bakar, S. A. R. Mokji, M. TK Electrical engineering. Electronics Nuclear engineering In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %. 2013 Conference or Workshop Item PeerReviewed Rashid, M. and Abu-Bakar, S. A. R. and Mokji, M. (2013) Human emotion recognition from videos using spatio-temporal and audio features. In: Visual Computer. http://dx.doi.org/ 10.1007/s00371-012-0768-y
institution	Universiti Teknologi Malaysia
building	UTM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Malaysia
content_source	UTM Institutional Repository
url_provider	http://eprints.utm.my/
topic	TK Electrical engineering. Electronics Nuclear engineering
spellingShingle	TK Electrical engineering. Electronics Nuclear engineering Rashid, M. Abu-Bakar, S. A. R. Mokji, M. Human emotion recognition from videos using spatio-temporal and audio features
description	In this paper, we present human emotion recognition systems based on audio and spatio-temporal visual features. The proposed system has been tested on audio visual emotion data set with different subjects for both genders. The mel-frequency cepstral coefficient (MFCC) and prosodic features are first identified and then extracted from emotional speech. For facial expressions spatio-temporal features are extracted from visual streams. Principal component analysis (PCA) is applied for dimensionality reduction of the visual features and capturing 97 % of variances. Codebook is constructed for both audio and visual features using Euclidean space. Then occurrences of the histograms are employed as input to the state-of-the-art SVM classifier to realize the judgment of each classifier. Moreover, the judgments from each classifier are combined using Bayes sum rule (BSR) as a final decision step. The proposed system is tested on public data set to recognize the human emotions. Experimental results and simulations proved that using visual features only yields on average 74.15 % accuracy, while using audio features only gives recognition average accuracy of 67.39 %. Whereas by combining both audio and visual features, the overall system accuracy has been significantly improved up to 80.27 %.
format	Conference or Workshop Item
author	Rashid, M. Abu-Bakar, S. A. R. Mokji, M.
author_facet	Rashid, M. Abu-Bakar, S. A. R. Mokji, M.
author_sort	Rashid, M.
title	Human emotion recognition from videos using spatio-temporal and audio features
title_short	Human emotion recognition from videos using spatio-temporal and audio features
title_full	Human emotion recognition from videos using spatio-temporal and audio features
title_fullStr	Human emotion recognition from videos using spatio-temporal and audio features
title_full_unstemmed	Human emotion recognition from videos using spatio-temporal and audio features
title_sort	human emotion recognition from videos using spatio-temporal and audio features
publishDate	2013
url	http://eprints.utm.my/id/eprint/51104/ http://dx.doi.org/ 10.1007/s00371-012-0768-y
_version_	1643652940170264576
score	13.209306

Human emotion recognition from videos using spatio-temporal and audio features

Similar Items