In Neural networks : the official journal of the International Neural Network Society
Transformers have achieved great success in many artificial intelligence fields, such as computer vision (CV), audio processing and natural language processing (NLP). In speech emotion recognition (SER), transformer-based architectures usually compute attention in a token-by-token (frame-by-frame) manner, but this approach lacks adequate capacity to capture local emotion information and is easily affected by noise. This paper proposes a novel SER architecture, referred to as block and token self-attention (BAT), that splits a mixed spectrogram into blocks and computes self-attention by combining these blocks with tokens, which can alleviate the effect of local noise while capturing authentic sentiment expressions. Furthermore, we present a cross-block attention mechanism to facilitate information interaction among blocks while integrating a frequency compression and channel enhancement (FCCE) module to smooth the attention biases between blocks and tokens. BAT achieves 73.2% weighted accuracy (WA) and 75.2% unweighted accuracy (UA) on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, surpassing the results of previously developed state-of-the-art approaches with the same dataset partitioning operation. Further experimental results reveal that our proposed method is also well suited for cross-database and cross-domain tasks, achieving 89% WA and 87.4% UA on Emo-DB and producing a top-1 recognition accuracy of 88.32% with only 15.01 Mb of parameters on the CIFAR-10 image dataset under a scenario with no data augmentation or pretraining.
Lei Jianjun, Zhu Xiangwei, Wang Ying
Self-attention, Speech emotion recognition, Transformer