Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

  • Bagus Tris Atmaja Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
  • Masato Akagi Japan Advanced Institute of Science of Technology, Nomi, Japan
  • Reda Elbarougy Damietta University, New Damietta, Egypt
Keywords: Speech Emotion, Neural Network, LSTM

Abstract

Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks.

Published
2020-12-26
How to Cite
[1]
B. Atmaja, M. Akagi, and R. Elbarougy, “Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks”, INJIISCOM, vol. 1, no. 1, pp. 91-102, Dec. 2020.