Staff View: Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition

Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition

Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals....

Full description

Saved in:

Bibliographic Details
Main Authors:	Wang, Jing, Saleem, Nasir, Gunawan, Teddy Surya
Format:	Article
Language:	English English English
Published:	Springer Nature 2024
Subjects:	TK7885 Computer engineering
Online Access:	http://irep.iium.edu.my/112153/1/112153_Towards%20efficient%20recurrent%20architectures.pdf http://irep.iium.edu.my/112153/2/112153_Towards%20efficient%20recurrent%20architectures_SCOPUS.pdf http://irep.iium.edu.my/112153/3/112153_Towards%20efficient%20recurrent%20architectures_WOS.pdf http://irep.iium.edu.my/112153/ https://link.springer.com/article/10.1007/s12559-024-10288-y
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my.iium.irep.112153
record_format	dspace
spelling	my.iium.irep.1121532024-06-20T06:43:32Z http://irep.iium.edu.my/112153/ Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition Wang, Jing Saleem, Nasir Gunawan, Teddy Surya TK7885 Computer engineering Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds. Springer Nature 2024-05 Article PeerReviewed application/pdf en http://irep.iium.edu.my/112153/1/112153_Towards%20efficient%20recurrent%20architectures.pdf application/pdf en http://irep.iium.edu.my/112153/2/112153_Towards%20efficient%20recurrent%20architectures_SCOPUS.pdf application/pdf en http://irep.iium.edu.my/112153/3/112153_Towards%20efficient%20recurrent%20architectures_WOS.pdf Wang, Jing and Saleem, Nasir and Gunawan, Teddy Surya (2024) Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition. Cognitive Computation, 16 (3). pp. 1221-1236. ISSN 1866-9956 E-ISSN 1866-9964 https://link.springer.com/article/10.1007/s12559-024-10288-y 10.1007/s12559-024-10288-y
institution	Universiti Islam Antarabangsa Malaysia
building	IIUM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	International Islamic University Malaysia
content_source	IIUM Repository (IREP)
url_provider	http://irep.iium.edu.my/
language	English English English
topic	TK7885 Computer engineering
spellingShingle	TK7885 Computer engineering Wang, Jing Saleem, Nasir Gunawan, Teddy Surya Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition
description	Long short-term memory (LSTM) has proven effective in modeling sequential data. However, it may encounter challenges in accurately capturing long-term temporal dependencies. LSTM plays a central role in speech enhancement by effectively modeling and capturing temporal dependencies in speech signals. This paper introduces a variable-neurons-based LSTM designed for capturing long-term temporal dependencies by reducing neuron representation in layers with no loss of data. A skip connection between nonadjacent layers is added to prevent gradient vanishing. An attention mechanism in these connections highlights important features and spectral components. Our LSTM is inherently causal, making it well-suited for real-time processing without relying on future information. Training involves utilizing combined acoustic feature sets for improved performance, and the models estimate two time–frequency masks—the ideal ratio mask (IRM) and the ideal binary mask (IBM). Comprehensive evaluation using perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) showed that the proposed LSTM architecture demonstrates enhanced speech intelligibility and perceptual quality. Composite measures further substantiated performance, considering residual noise distortion (Cbak) and speech distortion (Csig). The proposed model showed a 16.21% improvement in STOI and a 0.69 improvement in PESQ on the TIMIT database. Similarly, with the LibriSpeech database, the STOI and PESQ showed improvements of 16.41% and 0.71 over noisy mixtures. The proposed LSTM architecture outperforms deep neural networks (DNNs) in different stationary and nonstationary background noisy conditions. To train an automatic speech recognition (ASR) system on enhanced speech, the Kaldi toolkit is used for evaluating word error rate (WER). The proposed LSTM at the front-end notably reduced WERs, achieving a notable 15.13% WER across different noisy backgrounds.
format	Article
author	Wang, Jing Saleem, Nasir Gunawan, Teddy Surya
author_facet	Wang, Jing Saleem, Nasir Gunawan, Teddy Surya
author_sort	Wang, Jing
title	Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition
title_short	Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition
title_full	Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition
title_fullStr	Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition
title_full_unstemmed	Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition
title_sort	towards efficient recurrent architectures: a deep lstm neural network applied to speech enhancement and recognition
publisher	Springer Nature
publishDate	2024
url	http://irep.iium.edu.my/112153/1/112153_Towards%20efficient%20recurrent%20architectures.pdf http://irep.iium.edu.my/112153/2/112153_Towards%20efficient%20recurrent%20architectures_SCOPUS.pdf http://irep.iium.edu.my/112153/3/112153_Towards%20efficient%20recurrent%20architectures_WOS.pdf http://irep.iium.edu.my/112153/ https://link.springer.com/article/10.1007/s12559-024-10288-y
_version_	1802976965295079424
score	13.250246

Towards efficient recurrent architectures: a deep LSTM neural network applied to speech enhancement and recognition

Similar Items