How to install OpenSmile and extract various audio features

This post provides installation instruction which I have followed to install OpenSmile [1] and examples of it for extracting features from audio files.

Installation

I had Ubuntu 18.04 installed on virtual machine and followed the installation instructions given in the official documentation of OpenSmile.

So, first, download the OpenSmile and open your terminal. In the terminal change to the directory where OpenSmile is downloaded. For instance, I copied the downloaded file on the Desktop and then I ran the following command to change my directory in the terminal

cd Desktop

After changing the directory, I ran following commands

tar -zxvf opensmile-2.3.0.tar.gz

Note: Change the version number (or filename) according to the one you have downloaded.

cd opensmile-2.3.0

Then, I ran the following command twice. In the first run, I got the error “Makefile.in” not found. I ran it again and it worked perfectly.

Next, I ran the following commands

bash autogen.sh
./configure
make -j4; make
make install

I got following error while running above commands

src/include/core/vectorTransform.hpp:117:83: error: narrowing conversion of ‘'\37777777756'’ from ‘char’ to ‘unsigned char’ inside { } [-Wnarrowing]
 const unsigned char smileMagic[] = {(char)0xEE, (char)0x11, (char)0x11, (char)0x00};

I found a solution to the above problem here. So if you also got this problem, what you need to do is to open the vectorTransform.hpp file. You can find this file in your opensmile directory in src/include/core/vectorTransform.hpp path.

On line number 117, you will see below code

const unsigned char smileMagic[] = {(char)0xEE, (char)0x11, (char)0x11, (char)0x00};

Just delete the unsigned from the code (as the below code) and save the file

const char smileMagic[] = {(char)0xEE, (char)0x11, (char)0x11, (char)0x00};

I ran the make command again and this time it works like a charm 🙂

Checking the installation

Once the installation is done, you can see a file name SMILEEXTRACT in your OpenSmile directory. When you run the following command, you will see a message like the below one

./SMILEEXTRACT -h
 =============================================================== 
   openSMILE version 2.3.0 (Rev. 2014:2043)
   Build date: Feb 20 2019 (Fri Oct 28 21:16:39 CEST 2016)
   Build branch: 'opensmile-2.3.0'
   (c) 2014-2016 by audEERING GmbH
   All rights reserved. See the file COPYING for license terms.
   Lead author: Florian Eyben
 =============================================================== 

Extracting features using OpenSmile

In this section, I will show you how to extract various features using the OpenSmile library.

In the research area of Multimodal Learning Analytics, the OpenSmile library has been used for feature extraction purposes for building predictive models for various learning/teaching constructs e.g., collaboration, rapport, orchestration [2,3,4,5,6]. Following are the list of features that have been used in the mentioned research

paperfeatures
[2]Pitch -> F0 fundamental frequency
Intensity -> The normalized intensity
Voice quality -> Local jitter, DDP jitter, Shimmer
[3]Feature set of IS09 emotion challenge
[4]MFCC, Mel Frequency, Linear spectral coefficient, Loudness, Voicing, Fundamental frequency envelope, Pitch, Jitter, DDP jitter, Shimmer, Pitch onsets, duration
[5]*MFCC, Energy, Spectrum (mean, variance, kurtosis and skewness), RASTA, Zero crossing rate and chroma
*(This work used other tools as well)
[6]Low-level (spectrum, energy), High-level (emotion detection predictions)

In order to extract features using opensmile, we need to provide a configuration file specifying the number of configuration options. Thanks to the OpenSmile team, a number of pre-built configuration files already have been included in the library.

The following section will give a brief overview of the configuration file.

A bit about the configuration file

Though the provided configuration files will most likely serve your goal for extracting features from the audio, a bit understanding of configuration file will help you to better understand or customize as per your need.

The basic structure of the configuration file can be understood using the below diagram.

So, basically, we specify the number of components (already available in OpenSmile) starting from the input file and ending at the output file. Let’s take one example from OpenSmile documentation of extracting frame-wise energy features from an audio file.

So, if we think about the processing sequence, then it starts with the input component which will read the file and provide the input of the file to the next component. The next component then divides the data into frames and provides these framewise data as input to the next component which computes the energy for each frame’s data and sends these computed energy values to the next component. Finally, the last component will save the results in an output CSV file.

The configuration file begins with specifying the components needed for the task.

[componentInstances:cComponentManager]
instance[waveSource].type = cWaveSource
instance[framer].type = cFramer
instance[energy].type = cEnergy
instance[csvSink].type = cCsvSink

The first line specifies the configuration for the entire processing sequence. Each configuration is followed by [instance:type]. The following lines provide only the component. In order to connect these components to each other, we will specify the configuration for each component.

Here, we have four components. The first will acts as input and the last one acts as output. The second component is for dividing the data into frames and the third one is for computing energy for each frame.

Now, we need to specify the configuration option for each component. For instance, what will be the frame size for the second component, which algorithm should be used, and which component is connected to which one.

For each component, we need to specify the configuration options. Here, I am mentioning only for the first two and their connection. Each component has a reader and a writer which can read or write at different memory levels (you can use arbitrary names for different levels. we will see in the below example). In order to connect components, we use data memory levels. Let’s see our example for the first two components

[waveSource:cWaveSource]
writer.dmLevel = wave

[framer:cFramer]
reader.dmLevel = wave
write.dmLevel = waveframes

The first one is an input component that writes at a memory level `wave` (this could be any name). The next framer component read at the same memory level wave that means the framer component takes the output of the first component as input.

Similarly, to connect another component in the sequence next to framer, we need to specify the waveframes data memory level for next component’s reader.

OpenSmile provides commands to generate configuration files automatically.

SMILExtract -cfgFileTemplate -configDflt cWaveSource,cFramer,cEnergy,
cCsvSink -l 1 2> demo1.conf

The above command instructs OpenSmile to generate a configuration template file with four components (cWaveSource,cFramer,cEnergy,
cCsvSink
) and save it in demo1.conf file.


Time for action: extracting features

1. Chroma features

You can use the configuration file chroma_fft.conf available in config the directory. The default frame size 64ms with step size 10ms used with Gaussian window function. If you want to change the frame size, you can change frameSize parameter in chroma_fft.conf file.

2. MFCC features

To extract MFCC features, there are four configuration files available in opensmile (following table). A frame size of 25ms with 10ms step size used to compute MFCC features.

conf filedescription
MFCC12_0_D_A.conf13 MFCC features with 13 delta and 13 acceleration coefficients
MFCC12_E_D_A.conf13 MFCC features with 13 delta and 13 acceleration coefficients + log-energy for MFCC 1-12 features
MFCC12_0_D_A_Z.confMean normalized features
MFCC12_E_D_A_Z.confMean normalized features

3. Fundamental frequency, voicing probability, loudness features

We have two configuration files for computing fundamental frequency, voicing probability, and loudness features (which are also referred to as prosodic features.

conf filedescription
prosodyAcf.confAutocorrelation and cepstrum based method used to compute the fundamental frequency
prosodyShs.confSub-harmonic sampling algorithm used to compute the fundamental frequency

4. Emotion feature set

There are other configuration files available for computer several features for emotion recognition tasks. However, these features can be explored for your cases as well.

IS10_paraling.conf configuration files can be used to extract 1582 features from audio. These features are mfcc, loudness, voicing probability, LPC coefficients, fundamental frequency envelope, jitter, shimmer, jitter of jitter.

emo_large.conf the configuration file provides computation of a total of 6552 features from audio files.


Command for feature extraction

In order to extract features, the following command has to be executed on the terminal. Assuming the current directory in the terminal is opensmile.

./SMILExtract -C ./config/config-file-name -I audio-filename-withpath -O output-filename

Some useful OpenSmile commands

Showing a list of all available components in OpenSmile

./SMILExtract -L

The above command will list all components with their function.

./SMILExtract -H component-name

The above command will show the function of a particular component whose name has been provided in the command.


Demonstration of feature extraction

References

  1. F. Eyben, M. W¨ollmer, and B. Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the international conference on Multimedia, pages 1459–1462. ACM, 2010.
  2. Lubold, N., & Pon-Barry, H. (2014). Acoustic-Prosodic Entrainment and Rapport in Collaborative Learning Dialogues. Proceedings of the 2014 ACM Workshop on Multimodal Learning Analytics Workshop and Grand Challenge – MLA ’14, 5–12. https://doi.org/10.1145/2666633.2666635
  3. Müller, P., Huang, M. X., & Bulling, A. (2018). Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behaviour. https://doi.org/10.1145/3172944.3172969
  4. Vanlehn, K. (2018). Using the Tablet Gestures and Speech of Pairs of Students to Classify Their Collaboration. IEEE Transactions on Learning Technologies, 11(2), 230–242. https://doi.org/10.1109/TLT.2017.2704099
  5. Bassiou, N., Tsiartas, A., Smith, J., Bratt, H., Richey, C., Shriberg, E., … Alozie, N. (2016). Privacy-preserving speech analytics for automatic assessment of student collaboration. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 0812Sept, 888–892. https://doi.org/10.21437/Interspeech.2016-1569
  6. Prieto, L. P., Sharma, K., Kidzinski, Rodríguez-Triana, M. J., & Dillenbourg, P. (2018). Multimodal teaching analytics: Automated extraction of orchestration graphs from wearable sensor data. Journal of Computer Assisted Learning, 34(2), 193–203. https://doi.org/10.1111/jcal.12232

 

Pankaj Chejara