Feature extraction for emotion recognition from speech

🔬

I wrote these notes while I was working on multimodal emotion recognition during my master's at ETH Zürich. Feel free to check out the presentation, the full report, and the code for the full version.

Features

Feature extraction is the first step in most of the proposed emotion detection schemes based on speech. The objective is to extract a set of features from the raw audio file that is capable of representing the emotional state. Extracting the proper set of features is crucial and strongly influences the models' performance [1]. Even though there has been a significant amount of research on hand-crafting these sets of features, there is no consensus about which ones are more informative about emotions [2].

These features can be divided into three categories [1], namely, excitation source, vocal tract system, and prosodic features. For the task of emotion detection from speech, the most popular features are vocal tract and prosodic features.

Vocal tract features: generally extracted from segments with a length of 20-30 ms. Such features are known to be well reflected in the frequency domain of the speech signal. Thus, some of them are directly extracted from the Fourier transform of the signal, such as bandwidth, spectral energy, slope, and formants. The cepstral domain of a speech signal consists of the Fourier transform on the logarithmic magnitude spectrum and some of the popular features extracted are the MFCCs (Mel frequency cepstral coefficients) and LPCCs (linear prediction cepstral coefficients).
Prosodic features: the prosody is related to the rhythm, intonation, and intensity of speech. The prosody can be observed in acoustic features such as the fundamental frequency (F0), intensity, and duration.

Other features include voicing probability, zero-crossing rate, harmonics-to-noise-ratio, and jitter. Also, some approaches make use of statistical aggregator functions (mean, variance, min, max, range, median, higher order moments, etc.) on the lower-level features [2].

In [3], there is a recommendation for two sets of features that should be considered for the task of emotion recognition. These sets were constructed based on their theoretical significance, past success, and representation of the physiological changes in the voice during the affective process. These parameters are frequency, energy/amplitude, and spectral related (all of them fit in one of the categories previously mentioned here). Additionally, some statistical aggregator functions are proposed to group these features.

Most of the papers considered here (deploying deep learning) make use of a combination of this set of presented features, with MFCCs and the fundamental frequency (F0) being the most common among them. Some of them are [2, 4, 5].

Besides [1, 3], good overviews of the speech features for emotion recognition are given in [6, 7].

Suspected issues

As reported in [1], various machine learning approaches were explored to address the task of emotion recognition. These papers use classical machine learning algorithms (both supervised and unsupervised), such as Gaussian mixture models, support vector machines, K-nearest neighbors, Bayes classifier, among others.

These papers differ from one another based on the model class chosen and on the set of features extracted from the raw audio to make predictions. What surprised me the most were the accuracies reported: some of them achieved accuracies as high as 92% with relatively simple models.

After further thinking about these results, I began to suspect that some of them might be overfitting to their respective datasets.

First, there is the dataset issue. There is no single dataset available with large amounts of labeled speech emotion data with various speakers. Thus, some of the papers use data collected by themselves or some intrinsically limited datasets, such as datasets with a single speaker or with scripts that do not reflect real-world situations where emotion detection could be deployed.

Second, they commit to a set of features that are believed to faithfully represent emotions. Very often, this set of features is chosen experimentally and there might be a problem with the way they are selecting them. Generally, they start with a very large feature vector. Then, due to the computational infeasibility to work in such high dimensions, they select a subset of such features to reduce the dimensionality. The dimensionality reduction is performed in different ways, but on a higher level, we can think that the problem consists of finding a subset of features from the original feature vector that is, in the most general sense, informative to predict the emotion label.

The main issue is that, in a setting in which you don't have that much data and you are dealing with very high dimensional feature vectors, by chance, you are going to find some features that look correlated with the emotion label. These features may appear to contain valuable information about the label, but in fact, they might have no relation at all.

The authors, then, preprocess their data and train their models only on this subset of features. They find their model's parameters and fine-tune the hyperparameters based solely on this subset and then, report the achieved accuracy. Since the preprocessing stage is not included in the cross-validation loop and their database is not diverse enough, I suspect that their estimates might be overly optimistic.

Deep learning and emotion recognition from speech

The general approach discussed so far relies on collapsing the high-dimensional speech signal into low-dimensional encodings. Unfortunately, in this process, there is an inevitable loss of information that could be useful to discern between the emotion classes. The question that arises, then, is: how could we do better?

One possibility would be to try to use better features. For instance, there is a paper by Geoffrey Hinton where restricted Boltzmann machines are used to learn a better representation of speech sound waves [8]. Their method performed better on phoneme recognition than the ones using traditional MFCC features. The major drawback is that the computational cost of such approaches tends to be high [9].

Following the general trend in machine learning, some papers explore end-to-end emotion recognition. The idea is that the neural network is fed with raw unprocessed audio and it learns an intermediate representation that is suitable to the task at hand [10, 11].

These models start with convolutional layers and these are responsible for learning the intermediate signal representation. It is well-accepted within the research community that some of the acoustic features indeed play an important role in emotion recognition. One thing that is interesting about this approach is that some of these important features are learned by the model. For instance, Figure 1 (extracted from [10]) compares the activation of some of the channels with prosodic features extracted from an unseen speech recording.

Fig. 1. From [10]. "A visualisation of three different gate activations vs. different acoustic and prosodic features that are known to af- fect arousal for an unseen recording to the network. From top to bottom: range of RMS energy (ρ = 0.81), loud- ness (ρ = 0.73), mean of fundamental frequency (ρ = 0.72)"

References

[1] S. G. Koolagudi, K. S. Rao, "Emotion recognition from speech: a review", International Journal of Speech Technologies, 2012

[2] S. Mirsamadi, E. Barsoum , C. Zhang, "Automatic speech emotion recognition using RNNs with local attention", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2227 - 2231.

[3] F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André, C.Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan, K. Truong, "The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing", IEEE Transactions on Affective Computing, 2016, vol. 7, no. 2, pp. 190 - 202.

[4] Y. Wang, L. Neves, F. Metze, "Audio-based multimedia event detection using deep recurrent neural networks", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2742 - 2746.

[5] J. Lee and I. Tashev, "High-level feature representation using recurrent neural networks for speech emotion recognition", INTERSPEECH, 2015.

[6] M. Tahon, L. Devillers, "Towards a small set of robust acoustic features for emotion recognition: challenges", IEEE Transactions on Audio, Speech and Language Processing, 2016, vol. 24, no. 1, pp. 16 - 28.

[7] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt , J. Wagner, L. Devillers, L. Vidrascu , N. Amir, L. Kessous, V. Aharonson, "The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals", INTERSPEECH, 2007.

[8] N. Jaitly, G. Hinton, "Learning a better representation of speech sound waves using restricted boltzmann machines", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.

[9] A. Graves, N. Jaitly, "Towards end-to-end speech recognition with recurrent neural networks", International Conference on Machine Learning (ICML), 2014.

[10] G. Trigeorgis, F. Ringeval , R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, S. Zafeiriou, "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

[11] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, O. Vinyals, "Learning the speech front-end with raw waveform CLDNNs", INTERSPEECH, 2015.