2-D Attention Based Convolutional Recurrent Neural Network for Speech Emotion Recognition

  • Akalya Devi C
  • Karthika Renuka D
  • Aarshana E Winy
  • P C Kruthikkha
  • Ramya P
  • Soundarya S PSG college of Technology
Keywords: Keywords: Attention, Convolutional Recurrent Neural Networks, Speech Emotion Recognition, Spectrogram.


Recognizing speech emotions  is a formidable challenge due to the complexity of emotions. The function of Speech Emotion Recognition(SER) is significantly impacted by the effects of emotional signals retrieved from speech. The majority of emotional traits, on the other hand, are sensitive to emotionally neutral elements like the speaker, speaking manner, and gender. In this work, we postulate that computing deltas  for individual features maintain useful information which is mainly relevant to emotional traits while it minimizes the loss of emotionally irrelevant components, thus leading to fewer misclassifications. Additionally, Speech Emotion Recognition(SER) commonly experiences silent and emotionally unrelated frames. The proposed technique is quite good at picking up important feature representations for emotion relevant features. So here is a two  dimensional convolutional recurrent neural network that is attention-based to learn distinguishing characteristics and predict the emotions. The Mel-spectrogram is used for feature extraction. The suggested technique is conducted on IEMOCAP dataset and it has better performance, with 68% accuracy value.



[1] S. Lalitha, A. Madhavan, B. Bhushan and S. Saketh, "Speech emotion recognition," 2014 International Conference on Advances in Electronics Computers and Communications, 2014, pp. 1-4, doi: 10.1109/ICAECC.2014.7002390.

[2] Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-End Speech Emotion Recognition Using Deep Neural Networks. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi:10.1109/icassp.2018.8462677

[3] R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, "Speech Emotion Recognition Using Deep Learning Techniques: A Review," in IEEE Access, vol. 7, pp. 117327-117345, 2019, doi: 10.1109/ACCESS.2019.2936124.

[4] Chen, M., He, X., Yang, J., & Zhang, H. (2018). 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 25(10), 1440-1444.

[5] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, ―Learning salient features for speech emotion recognition using convolutional neural networks,‖ IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec 2014.

[6] Jinkyu Lee and Ivan Tashev, ―High-level feature representation using recurrent neural network for speech emotion recognition,‖ in Interspeech, 2015.

[7] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, ―ADIEU Features? End-to-end Speech Emotion Recognition using A Deep Convolutional Recurrent Network,‖ in IEEE International Conference on Acoustics, Speech
and Signal Processing, 2016, pp. 5200–5204.

[8] Huang, Che Wei, and S. S. Narayanan. "Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition." INTERSPEECH 2016:1387-1391.

[9] T L Nwe'; S W Foo L C De Silva, “Detection of Stress and Emotion in speech Using Traditional And FFT Based Log Energy Features” 0-7803-8185-8/03 2003 IEEE ( 2003)

[10] Z. Yongzhao and C. Peng, “Research and implementation of emotional feature extraction and recognition in speech signal,” Joural of Jiangsu University, vol. 26, no. 1, pp. 72–75, 2005.

[11] Tripathi, S., & Beigi, H. (2018). Multi-modal emotion recognition on IEMOCAP with neural networks. arXiv preprint arXiv:1804.05788.

[12] Qamhan, M. A., Meftah, A. H., Selouani, S. A., Alotaibi, Y. A., Zakariah, M., & Seddiq, Y. M. (2020, August). Speech emotion recognition using convolutional recurrent neural networks and spectrograms. In 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 1-5). IEEE.

[13] Mu, Y., Gómez, L. A. H., Montes, A. C., Martínez, C. A., Wang, X., & Gao, H. (2017). Speech emotion recognition using convolutional-recurrent neural networks with attention model. DEStech Trans. Comput. Sci. Eng., (cii), 341-350.

[14] P. Jiang, X. Xu, H. Tao, L. Zhao and C. Zou, "Convolutional-Recurrent Neural Networks with Multiple Attention Mechanisms for Speech Emotion Recognition," in IEEE Transactions on Cognitive and Developmental Systems, doi: 10.1109/TCDS.2021.3123979.

[15] Yadav, O. P., Bastola, L. P., & Sharma, J. (2021). Speech Emotion Recognition using Convolutional Recurrent Neural Network.Speech Emotion Recognition using Convolutional Recurrent Neural Network

[16] W. Lim, D. Jang and T. Lee, "Speech emotion recognition using convolutional and Recurrent Neural Networks," 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1-4, doi: 10.1109/APSIPA.2016.7820699.

[17] Gayathri, P., Priya, P. G., Sravani, L., Johnson, S., & Sampath, V. (2020). Convolutional Recurrent Neural Networks Based Speech Emotion Recognition. Journal of Computational and Theoretical Nanoscience, 17(8), 3786-3789.
How to Cite
A. D. C, K. R. D, A. E. Winy, P. C. Kruthikkha, R. P, and S. S, “2-D Attention Based Convolutional Recurrent Neural Network for Speech Emotion Recognition”, INJIISCOM, vol. 3, no. 2, pp. 21-30, Dec. 2022.