Speaker Recognition
At the Signal Processing lab, BIT Mesra, we worked on the automatic speaker recognition project to predict the speaker given a speech utterance. Speaker Recognition is one of the principle problems in Speech processing. The performance of speaker recognition systems can be improved by carefully choosing and calculating suitable features, which is an arduous task.
This project was done on a custom dataset containing hindi digit utterances by 50 speakers. The database consisted of 5000 utterances, 100 for each of the 50 different speakers, in both clean and noisy environment, with varying levels of noise from -5dB, 0dB, 5dB, 10dB, 20dB and 30dB.
The MFCC (Mel Frequency Cepstral Coefficients) of these utterances were used as features to train and evaluate the neural networks. We performed a comparative analysis of four different neural networks for this task viz. Single Hidden Layer Neural Network, Multi Layer Perceptron(Deep Neural Network), Radial Basis Function Neural Network(RBFNN) and Probabilistic Neural Network(PNN) . MATLAB was used for the implementation and experiments.
Accuracy of all neural networks was expectedly very high (>90%) for clean data, large variations coming in with introduction and change in the level of noise. RBFNN has been shown to consistently perform well under all conditions. DNN was the other consistent performer and has the potential to outperform other techniques, if trained on more data.
The findings of this project were selected for publication in IEEE Explore and Scopus and presented in the proceedings of 3rd IEEE International Conference on Electrical, Computer and Communication Technologies(ICECCT) , 2019 after peer review.
We also worked with the same dataset for analysis of neural networks performance in Speech Recognition task where we compared DNN,RBFNN, PNN, Self Organizing Maps(SOM,unsupervised) for digit recognition, and Speaker Verification task where we compared Regularized RBFNN , Normalized RBFNN , and Deep Neural Networks to verify the identity of a speaker given his new utterance by nearest neighbour prediction on extracted representations.