Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits

This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two componen...

Full description

Bibliographic Details
Published in:JURNAL KEJURUTERAAN
Main Authors: Fazlic, Husein; Abd Almisre, Ali; Tahir, Nooritawati Md
Format: Article
Language:English
Published: UKM PRESS 2024
Subjects:
Online Access:https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157147500024
author Fazlic
Husein; Abd Almisre
Ali; Tahir
Nooritawati Md
spellingShingle Fazlic
Husein; Abd Almisre
Ali; Tahir
Nooritawati Md
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
Engineering
author_facet Fazlic
Husein; Abd Almisre
Ali; Tahir
Nooritawati Md
author_sort Fazlic
spelling Fazlic, Husein; Abd Almisre, Ali; Tahir, Nooritawati Md
Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
JURNAL KEJURUTERAAN
English
Article
This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets.
UKM PRESS
0128-0198
2289-7526
2024
36
1
10.17576/jkukm-2024-36(1)-14
Engineering
gold
WOS:001157147500024
https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157147500024
title Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
title_short Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
title_full Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
title_fullStr Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
title_full_unstemmed Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
title_sort Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits
container_title JURNAL KEJURUTERAAN
language English
format Article
description This study presents a deep learning-based solution for audio-visual speech recognition of Bosnian digits. The task posed a challenge due to the lack of an appropriate Bosnian language dataset, and this study outlines the approach to building a new dataset. The proposed solution includes two components: visual speech recognition, which involves lip reading, and audio speech recognition. For visual speech recognition, a combined CNN-RNN architecture was utilised, consisting of two CNN variants namely Google Net and ResNet-50. These architectures were compared based on their performance, with ResNet-50 achieving 72% accuracy and Google Net achieving 63% accuracy. The RNN component used LSTM. For audio speech recognition, FFT is applied to obtain spectrograms from the input speech signal, which are then classified using a CNN architecture. This component achieved an accuracy of 100%. The dataset was split into three parts namely for training, validation, and testing purposes such that 80%, 10% and 10% of data is allocated to each part, respectively. Furthermore, the predictions from the visual and audio models were combined that yielded 100% accuracy based on the developed dataset. The findings from this study demonstrate that deep learning-based methods show promising results for audio-visual speech recognition of Bosnian digits, despite the challenge of limited Bosnian language datasets.
publisher UKM PRESS
issn 0128-0198
2289-7526
publishDate 2024
container_volume 36
container_issue 1
doi_str_mv 10.17576/jkukm-2024-36(1)-14
topic Engineering
topic_facet Engineering
accesstype gold
id WOS:001157147500024
url https://www-webofscience-com.uitm.idm.oclc.org/wos/woscc/full-record/WOS:001157147500024
record_format wos
collection Web of Science (WoS)
_version_ 1809678632974548992