Enhancing Bangla Local Speech-to-Text Conversion Using Fine-Tuning Wav2vec 2.0 with OpenSLR and Self-Compiled Datasets Through Transfer Learning

Hossain, SK Muktadir; Rihan, Md Rahat; Imtiaz, Ahmed; Boni, SK Muktadir; Gomes, Dipta

Please use this identifier to cite or link to this item: http://dspace.aiub.edu:8080/jspui/handle/123456789/2662

Title:	Enhancing Bangla Local Speech-to-Text Conversion Using Fine-Tuning Wav2vec 2.0 with OpenSLR and Self-Compiled Datasets Through Transfer Learning
Authors:	Hossain, SK Muktadir Rihan, Md Rahat Imtiaz, Ahmed Boni, SK Muktadir Gomes, Dipta
Keywords:	Bangla Speech Recognition Automatic Speech Recognition (ASR) Speech Technology wav2vec 2.0 Transfer Learning
Issue Date:	15-Mar-2025
Publisher:	IEOM Society International
Citation:	Hossain, S., Rihan, R., Imtiaz, A., Boni, P., & Gomes, D. (2024, December). Enhancing Bangla Local Speech-to-Text Conversion Using Fine-Tuning Wav2vec 2.0 with OpenSLR and Self-Compiled Datasets Through Transfer Learning. In 7th IEOM Bangladesh International Conference on Industrial Engineering and Operations Management, https://doi.org/10.46254/BA07.20240161.
Abstract:	An improved method to create an enhanced Bangla standard and local speech. The wav2vec 2.0 model has been fine-tuned using additional datasets collected alongside OpenSLR data. Our findings have shown that there are gains in transcrip- tion accuracy of as much as eleven percent, which is impressive given the low resources and languages employed, proving the merits of transfer learning and fine-tuning. The work of the research is aimed at expanding the knowledge base concerning the use of novel deep learning algorithms in small languages in the field of speech technology. The evaluation metrics included Word Error Rate (WER) and Character Error Rate (CER), with the fine-tuned model achieving an overall WER of 11.27% and CER of 6.03%. Comparative analysis with previous work shows a significant improvement from baseline models, highlighting the efficacy of the wav2vec 2.0 model in leveraging large and diverse datasets. The experimental setup was supported by a cluster computing environment with NVIDIA CUDA-compatible GPUs, underscoring the computational resources required for effective Automatic Speech Recognition (ASR) model training. The re- sults demonstrate substantial advancements in ASR performance for Bengali, with the fine-tuned model outperforming previous benchmarks and showcasing the benefits of self-supervised learn- ing approaches.
URI:	https://index.ieomsociety.org/index.cfm/article/view/ID/28375 http://dspace.aiub.edu:8080/jspui/handle/123456789/2662
ISBN:	979-8-3507-4443-9
ISSN:	2169-8767
Appears in Collections:	Publications: Journals

Files in This Item:

File	Description	Size	Format
Enhancing Bangla Local STT-wav2vec.pdf	First Page Manuscript	151.85 kB	Adobe PDF	View/Open

Show full item record

AIUB DSpace

Welcome to the Institutional Repository of American International University-Bangladesh. We preserve and enable easy and open access to all types of digital content including text, images, moving images, mpegs and data sets.