On the use of voice activity detection in speech emotion recognition

Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing t...

Full description

Bibliographic Details
Published in:Bulletin of Electrical Engineering and Informatics
Main Author: Alghifari M.F.; Gunawan T.S.; Wan Nordin M.A.B.; Qadri S.A.A.; Kartiwi M.; Janin Z.
Format: Article
Language:English
Published: Institute of Advanced Engineering and Science 2019
Online Access:https://www.scopus.com/inward/record.uri?eid=2-s2.0-85075591444&doi=10.11591%2feei.v8i4.1646&partnerID=40&md5=e0ee362267a15f85350b225970da84e5
id 2-s2.0-85075591444
spelling 2-s2.0-85075591444
Alghifari M.F.; Gunawan T.S.; Wan Nordin M.A.B.; Qadri S.A.A.; Kartiwi M.; Janin Z.
On the use of voice activity detection in speech emotion recognition
2019
Bulletin of Electrical Engineering and Informatics
8
4
10.11591/eei.v8i4.1646
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85075591444&doi=10.11591%2feei.v8i4.1646&partnerID=40&md5=e0ee362267a15f85350b225970da84e5
Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%. © 2019 Institute of Advanced Engineering and Science. All rights reserved.
Institute of Advanced Engineering and Science
20893191
English
Article
All Open Access; Gold Open Access
author Alghifari M.F.; Gunawan T.S.; Wan Nordin M.A.B.; Qadri S.A.A.; Kartiwi M.; Janin Z.
spellingShingle Alghifari M.F.; Gunawan T.S.; Wan Nordin M.A.B.; Qadri S.A.A.; Kartiwi M.; Janin Z.
On the use of voice activity detection in speech emotion recognition
author_facet Alghifari M.F.; Gunawan T.S.; Wan Nordin M.A.B.; Qadri S.A.A.; Kartiwi M.; Janin Z.
author_sort Alghifari M.F.; Gunawan T.S.; Wan Nordin M.A.B.; Qadri S.A.A.; Kartiwi M.; Janin Z.
title On the use of voice activity detection in speech emotion recognition
title_short On the use of voice activity detection in speech emotion recognition
title_full On the use of voice activity detection in speech emotion recognition
title_fullStr On the use of voice activity detection in speech emotion recognition
title_full_unstemmed On the use of voice activity detection in speech emotion recognition
title_sort On the use of voice activity detection in speech emotion recognition
publishDate 2019
container_title Bulletin of Electrical Engineering and Informatics
container_volume 8
container_issue 4
doi_str_mv 10.11591/eei.v8i4.1646
url https://www.scopus.com/inward/record.uri?eid=2-s2.0-85075591444&doi=10.11591%2feei.v8i4.1646&partnerID=40&md5=e0ee362267a15f85350b225970da84e5
description Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%. © 2019 Institute of Advanced Engineering and Science. All rights reserved.
publisher Institute of Advanced Engineering and Science
issn 20893191
language English
format Article
accesstype All Open Access; Gold Open Access
record_format scopus
collection Scopus
_version_ 1809677901224738816