Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

Bagus Tris Atmaja; Masato Akagi; Reda Elbarougy

doi:10.34010/injiiscom.v1i1.4023

Authors

Bagus Tris Atmaja Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
Masato Akagi Japan Advanced Institute of Science of Technology, Nomi, Japan
Reda Elbarougy Damietta University, New Damietta, Egypt

DOI:

https://doi.org/10.34010/injiiscom.v1i1.4023

Keywords:

Speech Emotion, Neural Network, LSTM

Abstract

Emotion can be inferred from tonal and verbal information, where both features can be extracted from speech. While most researchers conducted studies on categorical emotion recognition from a single modality, this research presents a dimensional emotion recognition combining acoustic and text features. A number of 31 acoustic features are extracted from speech, while word vector is used as text features. The initial result on single modality emotion recognition can be used as a cue to combine both features with improving the recognition result. The latter result shows that a combination of acoustic and text features decreases the error of dimensional emotion score prediction by about 5% from the acoustic system and 1% from the text system. This smallest error is achieved by combining the text system with Long Short-Term Memory (LSTM) networks and acoustic systems with bidirectional LSTM networks and concatenated both systems with dense networks.

Dimensional Speech Emotion Recognition from Acoustic and Text Features using Recurrent Neural Networks

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

sidebar2

template

sidebar1

issn

tool

visitor