Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two componen...
Published in: | JURNAL KEJURUTERAAN |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Published: |
UKM PRESS
2024
|
Subjects: | |
Online Access: | https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157147500024 |
author |
Fazlic Husein; Abd Almisre Ali; Tahir Nooritawati Md |
---|---|
spellingShingle |
Fazlic Husein; Abd Almisre Ali; Tahir Nooritawati Md Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits Engineering |
author_facet |
Fazlic Husein; Abd Almisre Ali; Tahir Nooritawati Md |
author_sort |
Fazlic |
spelling |
Fazlic, Husein; Abd Almisre, Ali; Tahir, Nooritawati Md Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits JURNAL KEJURUTERAAN English Article This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets. UKM PRESS 0128-0198 2289-7526 2024 36 1 10.17576/jkukm-2024-36(1)-14 Engineering gold WOS:001157147500024 https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157147500024 |
title |
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits |
title_short |
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits |
title_full |
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits |
title_fullStr |
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits |
title_full_unstemmed |
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits |
title_sort |
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits |
container_title |
JURNAL KEJURUTERAAN |
language |
English |
format |
Article |
description |
This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets. |
publisher |
UKM PRESS |
issn |
0128-0198 2289-7526 |
publishDate |
2024 |
container_volume |
36 |
container_issue |
1 |
doi_str_mv |
10.17576/jkukm-2024-36(1)-14 |
topic |
Engineering |
topic_facet |
Engineering |
accesstype |
gold |
id |
WOS:001157147500024 |
url |
https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157147500024 |
record_format |
wos |
collection |
Web of Science (WoS) |
_version_ |
1809678632974548992 |