In this work, we introduce a methodology for the recognition of crowd emotions from crowd speech and sound in mass events. Different emotional categories can be encoded via frequency-amplitude features of emotional crowd speech. The proposed technique uses visual transfer learning applied to the input sound spectrograms. Spectrogram images are generated starting from snippets of fixed length taken from the original sound clip. The plots are then filtered and normalized concerning frequency and magnitude and then fed to a pre-trained Convolutional Neural Network (CNN) for images (AlexNet) integrated with domain-specific categorical layers. The integrated CNN is re-trained with the labeled spectrograms of crowd emotion sounds in order to adapt and fine-tune the recognition of the crowd emotional categories. Preliminary experiments have been held on a dataset collecting publicly-available sound clips of different mass events for each class, including Joy, Anger and Neutral. While transfer learning has been applied in existing literature to music and speech processing, to the best of our knowledge, this is the first application to crowd-sound emotion recognition.
Crowd emotional sounds: Spectrogram-based analysis using convolutional neural networks
Franzoni V.
Supervision
;Biondi G.Membro del Collaboration Group
;Milani A.Membro del Collaboration Group
2019
Abstract
In this work, we introduce a methodology for the recognition of crowd emotions from crowd speech and sound in mass events. Different emotional categories can be encoded via frequency-amplitude features of emotional crowd speech. The proposed technique uses visual transfer learning applied to the input sound spectrograms. Spectrogram images are generated starting from snippets of fixed length taken from the original sound clip. The plots are then filtered and normalized concerning frequency and magnitude and then fed to a pre-trained Convolutional Neural Network (CNN) for images (AlexNet) integrated with domain-specific categorical layers. The integrated CNN is re-trained with the labeled spectrograms of crowd emotion sounds in order to adapt and fine-tune the recognition of the crowd emotional categories. Preliminary experiments have been held on a dataset collecting publicly-available sound clips of different mass events for each class, including Joy, Anger and Neutral. While transfer learning has been applied in existing literature to music and speech processing, to the best of our knowledge, this is the first application to crowd-sound emotion recognition.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.