Global-Local Self-Attention-Based Long Short-Term Memory with Optimization Algorithm for Speaker Identification

Pravin Marotrao Ghate; Bhagvat D. Jadhav; Shriram Sadashiv Kulkarni; Pravin Balaso Chopade; Prabhakar N. Kota

doi:10.31436/iiumej.v26i1.3386

Authors

Pravin Marotrao Ghate JSPM's Rajarshi Shahu College of Engineering
Bhagvat D. Jadhav JSPM's Rajarshi Shahu College of Engineering https://orcid.org/0000-0002-1393-6823
Shriram Sadashiv Kulkarni Sinhgad Academy of Engineering https://orcid.org/0000-0001-8584-8171
Pravin Balaso Chopade M.E.S College of Engineering
Prabhakar N. Kota M.E.S College of Engineering https://orcid.org/0000-0002-7537-8433

DOI:

https://doi.org/10.31436/iiumej.v26i1.3386

Keywords:

Exponential Neighborhood – Grey Wolf Optimization, lobal-Local Self-Attention, Long Short-Term Memory, Mel Frequency Cepstral Coefficient, Speaker Identification

Abstract

Speaker identification (SI) involves recognizing a speaker from a group of unknown speakers, while speaker verification (SV) determines if a given voice sample belongs to a particular person. The main drawbacks of SI are session variability, noise in the background, and insufficient information. To mitigate the limitations mentioned above, this research proposes Global Local Self-Attention (GLSA) based Long Short-Term Memory (LSTM) with Exponential Neighborhood – Grey Wolf Optimization (EN-GWO) method for effective speaker identification using TIMIT and VoxCeleb 1 datasets. The GLSA is incorporated in LSTM, which focuses on the required data, and the hyperparameters are tuned using the EN-GWO, which enhances speaker identification performance. The GLSA-LSTM with EN-GWO method acquires an accuracy of 99.36% on the TIMIT dataset, and an accuracy of 93.45% on the VoxCeleb 1 datasets, while compared to SincNet and Generative Adversarial Network (SincGAN) and Hybrid Neural Network – Support Vector Machine (NN-SVM).

ABSTRAK: Pengenalpastian pembicara (Speaker Identification, SI) melibatkan pengenalan pembicara daripada kumpulan pembicara yang tidak dikenali, manakala pengesahan pembicara (Speaker Verification, SV) menentukan sama ada sampel suara tertentu milik seseorang individu. Kekurangan utama dalam SI ialah variasi sesi, bunyi latar belakang, dan maklumat yang tidak mencukupi. Untuk mengatasi kekangan tersebut, kajian ini mencadangkan kaedah Global Local Self-Attention (GLSA) berasaskan Long Short-Term Memory (LSTM) dengan Pengoptimuman Grey Wolf Jiranan Eksponen (EN-GWO) bagi pengenalpastian pembicara yang berkesan menggunakan set data TIMIT dan VoxCeleb 1. GLSA digabungkan dalam LSTM yang memberi tumpuan pada data yang diperlukan, manakala parameter hiper ditala menggunakan EN-GWO untuk meningkatkan prestasi pengenalpastian pembicara. Kaedah GLSA-LSTM dengan EN-GWO mencapai ketepatan 99.36% pada dataset TIMIT dan ketepatan 93.45% pada dataset VoxCeleb 1, berbanding dengan SincNet dan Generative Adversarial Network (SincGAN) serta Hybrid Neural Network – Support Vector Machine (NN-SVM).

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Shah SM, Moinuddin M, Khan RA. (2022) A Robust Approach for Speaker Identification Using Dialect Information. Appl. Comput. Intell. Soft Comput., 2022(1):4980920. https://doi.org/10.1155/2022/4980920 DOI: https://doi.org/10.1155/2022/4980920

Nassif AB, Alnazzawi N, Shahin I, Salloum SA, Hindawi N, Lataifeh M, Elnagar A. (2022) A Novel RBFNN-CNN Model for Speaker Identification in Stressful Talking Environments. Applied Sciences,12(10):4841. https://doi.org/10.3390/app12104841 DOI: https://doi.org/10.3390/app12104841

Dua S, Kumar SS, Albagory Y, Ramalingam R, Dumka A, Singh R, Rashid M, Gehlot A, Alshamrani SS, AlGhamdi AS. (2022) Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Applied Sciences, 12(12):6223. https://doi.org/10.3390/app12126223 DOI: https://doi.org/10.3390/app12126223

Monir M, Kareem M, El-Dolil SM, Saleeb A, El-Fishawy AS, Nassar MAE, Eldin MZA, Abd El-Samie FE. (2022) Cancelable speaker identification based on cepstral coefficients and comb filters. Int. J. Speech Technol., 25(2):471-492. https://doi.org/10.1007/s10772-021-09804-4 DOI: https://doi.org/10.1007/s10772-021-09804-4

Chen YW, Hung KH, Li YJ, Kang ACF, Lai YH, Liu KC, Fu SW, Wang SS, Tsao Y. (2022) CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application. IEEE Access, 10:46082-46099. doi: 10.1109/ACCESS.2022.3153469 DOI: https://doi.org/10.1109/ACCESS.2022.3153469

Noh K, Jeong H. (2023) Emotion-Aware Speaker Identification with Transfer Learning. IEEE Access, 11:77292-77306. doi: 10.1109/ACCESS.2023.3297715. DOI: https://doi.org/10.1109/ACCESS.2023.3297715

Nakamura E, Kageyama Y, Hirose S. (2022) LSTM?based japanese speaker identification using an omnidirectional camera and voice information. IEEJ Trans. Electr. Electron. Eng., 17(5):674-684. https://doi.org/10.1002/tee.23555 DOI: https://doi.org/10.1002/tee.23555

De Lima TA, Abreu MCD. (2022) Phoneme analysis for multiple languages with fuzzy?based speaker identification. IET Biom., 11(6):614-624. https://doi.org/10.1049/bme2.12078 DOI: https://doi.org/10.1049/bme2.12078

Saritha B, Laskar MA, Kirupakaran AM, Laskar RH, Choudhury M, Shome N. (2024) Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal. Circuits Syst. Signal Process., 43(3):1839-1861. https://doi.org/10.1007/s00034-023-02542-9 DOI: https://doi.org/10.1007/s00034-023-02542-9

Shome N, Saritha B, Kashyap R, Laskar RH. (2023) A robust DNN model for text-independent speaker identification using non-speaker embeddings in diverse data conditions. Neural Comput. Appl., 35(26):18933-18947. https://doi.org/10.1007/s00521-023-08736-1 DOI: https://doi.org/10.1007/s00521-023-08736-1

Al-Karawi KA, Mohammed DY. (2023) Using combined features to improve speaker verification in the face of limited reverberant data. Int. J. Speech Technol., 26:789-799. https://doi.org/10.1007/s10772-023-10048-7 DOI: https://doi.org/10.1007/s10772-023-10048-7

Kuppusamy K, Eswaran C. (2022) Convolutional and Deep Neural Networks based techniques for extracting the age-relevant features of the speaker. J. Ambient Intell. Hum. Comput., 13(12):5655-5667. https://doi.org/10.1007/s12652-021-03238-1 DOI: https://doi.org/10.1007/s12652-021-03238-1

Radha K, Bansal M. (2023) Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children. Int. J. Inf. Technol., 15(3):1375-1385. https://doi.org/10.1007/s41870-023-01224-8 DOI: https://doi.org/10.1007/s41870-023-01224-8

Malek J, Jansky J, Koldovsky Z, Kounovsky T, Cmejla J, Zdansky J. (2022) Target speech extraction: Independent vector extraction guided by supervised speaker identification. IEEE/ACM Trans. Audio Speech Lang. Process., 30:2295-2309. DOI: https://doi.org/10.1109/TASLP.2022.3190739

El Shafai W, Elsayed MA, Rashwan MA, Dessouky MI, El-Fishawy AS, Soliman NF, Alhussan AA, Abd El-Samie FE. (2023) Optical Ciphering Scheme for Cancellable Speaker Identification System. Computer Systems Science and Engineering, 45(1):563-578. https://doi.org/10.32604/csse.2023.024375 DOI: https://doi.org/10.32604/csse.2023.024375

El-Gazar S, El Shafai W, El Banby G, Hamed HFA, Salama GM, Abd-Elnaby M, Abd El-Samie FE. (2022) Cancelable Speaker Identification System Based on Optical-Like Encryption Algorithms. Computer Systems Science and Engineering, 43(1):87-102. DOI:10.32604/csse.2022.022722 DOI: https://doi.org/10.32604/csse.2022.022722

Wei G, Zhang Y, Min H, Xu Y. (2023) End-to-end speaker identification research based on multi-scale SincNet and CGAN. Neural Comput. Appl., 35(30):22209-22222. https://doi.org/10.1007/s00521-023-08906-1 DOI: https://doi.org/10.1007/s00521-023-08906-1

Karthikeyan V, Priyadharsini SS, Balamurugan K, Ramasamy M. (2022) Speaker identification using hybrid neural network support vector machine classifier. Int. J. Speech Technol., 25(4):1041-1053. https://doi.org/10.1007/s10772-021-09902-3 DOI: https://doi.org/10.1007/s10772-021-09902-3

Shahamiri SR. (2023) An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments. Multimedia Tools Appl,, 83:24541-24562. https://doi.org/10.1007/s11042-023-16507-2 DOI: https://doi.org/10.1007/s11042-023-16507-2

Barhoush M, Hallawa A, Schmeink A. (2023) Speaker identification and localization using shuffled MFCC features and deep learning. Int. J. Speech Technol., 26(1):185-196. https://doi.org/10.1007/s10772-023-10023-2 DOI: https://doi.org/10.1007/s10772-023-10023-2

Gaurav, Bhardwaj S, Agarwal R. (2023) An efficient speaker identification framework based on Mask R-CNN classifier parameter optimized using hosted cuckoo optimization (HCO). J. Ambient Intell. Hum. Comput., 14(10):13613-13625. https://doi.org/10.1007/s12652-022-03828-7 DOI: https://doi.org/10.1007/s12652-022-03828-7

Al-Dulaimi HW, Aldhahab A, Al Abboodi HM. (2023) Speaker Identification System Employing Multi-resolution Analysis in Conjunction with CNN. International Journal of Intelligent Engineering & Systems, 16(5):350-363. DOI: 10.22266/ijies2023.1031.30 DOI: https://doi.org/10.22266/ijies2023.1031.30

Dataset TIMIT: https://www.kaggle.com/datasets/nltkdata/timitcorpus.

Dataset VoxCeleb 1: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html.