Real-time On-edge Acoustic Event Classification

This post is based on the following publications:

  • Real-time On-edge Acoustic Event Classification. Vuegen, L., Karsmakers, P., Real-time On-edge Acoustic Event Classification, Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), 2021.

Why edge AI?

Edge AI is the deployment of AI applications on embedded devices near the sensor, close to where the data is located, rather than centrally in a cloud computing facility or a data center. Processing data at the edge comes with the advantage that no sensitive data must be transmitted to the cloud and thereby offering an improved user privacy, lower latency, reduced energy consumption, and increased scalability.

Recent advances in the computing power of today’s off-the-shelve microcontroller units (MCUs), together with the improved efficacy of the deep learning AI algorithms, resulted that edge AI gained a lot of interest recently. Edge AI opened opportunities for machines and devices to operate with the ‘intelligence’ of human cognition without the need of expensive computing infrastructure resulting in a wide range of new real-time applications such as:

  • Predictive maintenance – on-the-fly machine state recognition and anomaly detection in industrial environments.
  • Cognitive homes – real-time domestic person identification, keyword spotting, and acoustic event/scene classification.
  • Smart healthcare – early detection of medical complications at home from vital sign monitoring.
  • Autonomous vehicles – driver assistance by real-time monitoring the surrounding traffic and environment from the built-in sensor data such as radar and/or lidar.

All without any sensor data ever leaving the device without your permission.

Although the use of acoustic sensors for indoor monitoring applications is already widely examined in the scientific community, most of the proposed solutions make use of complex and computationally demanding deep learning frameworks running at the cloud.

In this post we will discuss how the complexity of a convolutional neural network (CNN) can be reduced to enable a real-time on-edge deployment for the purpose of acoustic event classification. The next two videos are a demonstration of the deployed classifier model in a real-life home environment when embedded on an ARM Cortex-M7 (i.MX RT1064) platform.

Real-life demo 1 Real-life demo 2

Optimized on-edge CNN architectures for audio classifcation

Despite the increased computing power in today’s off-the-shelve MCUs, deploying CNNs at the edge require a trade-off between model complexity and classification performance. The complexity of a CNN can be controlled on both the ‘architectural’ level and the ‘arithmetic’ level

Standard CNN architectures typically make use of a pooling layer after each convolutional layer to add local translation invariance (LTI), i.e., the ability of the model to ignore positional shifts in the data as shown in Figure 1, and to reduce the dimensionality of the feature maps. The feature map dimensionality can also be reduced without the need of pooling layers by simply increasing the stride length of the convolutinal filters. This removes the LTI property of the model but comes with a significant lower number of operations which is beneficial for an embedded implementation.

In this post, we empirically examine if the LTI property is really required in case when working on time-frequency data since we can assume that acoustical patterns linked to a specific sound class always occur in the same frequency region. The latter is done by comparing the obtained classification results of a traditional ‘pooling’ model architecture with the results of a less computational demanding ‘no-pooling’ model architecture when learned from self-collected domestic sound event data.

LTI

Figure 1: Local translation invariance is the ability of the model to ignore positional shifts of the target in the input data. This property makes sense when working on images, but not in the case of time-frequency representations.

The investigated CNN architectures are all made up of three convolutional layers (for feature learning) and two fully connected layers (for classification), and are evaluated with two different input sizes and three different numbers of filters per convolutional layer. More specifically, the feature learning part of the two model architectures are configured as follows:

  • ‘Pooling’ model architecture - three convolutional layers with a filter size of (4×4), filter stride of (1 × 1), padding, relu activation, and a pool size of (1 × 4).
  • ‘No-pooling’ model architecture - three convolutional layers with a filter size of (4 × 4), filter stride of (1 × 4), no-padding, relu activation, and no-pooling.

The classification part of both models are made up of a fully-connected layer with 64 output neurons and a relu activation, and a fully-connected layer with 8 output neurons (nr. of classes) and a softmax activation for classification. The two examined input shape sizes are (50 × 64) and (100 × 64) corresponding to an audio input size of 0.5 and 1.0 sec. respectively given a frame shift of 10 ms. The examined number of filters per convolutional layer are 16 (model small), 32 (model medium) and 64 (model large).

CNN!

Figure 2: Overview of the ‘pooling’ (top) and ’no-pooling’ (bottom) model architectures.

All models are learned offline in TensorFlow with a 32-bit floating-point precision. The learned models are then converted to an 8-bit fixed-point precision for an embedded deployment using the TensorFlow Lite quantized-aware retraining procedure. This retraining operation simulates the effects of quantization during inference and is done as follows:

  • Feed-forward pass – the effects of quantization are simulated by quantizing the weights and activations to 8-bit integers and converting them back to 32-bit floats. This operation mimics the effects of a less precise data representation although all computations are still performed with a 32-bit floating-point representation.
  • Feed-backward pass – the model weights will be updated using backpropagation. The gradients to update the weights are calculated from the outputs obtained in the feed-forward pass.

Results

The obtained results are listed in the Table 1. By analyzing the classification scores for both the unquantized and quantized models, i.e. u/q-rows respectively, it can be clearly seen that the used quantization scheme has practically no impact on the classification performance (< 1%). This implies that these models can be deployed on an embedded platform with an integer-only inference framework without the need of sacrificing classification accuracy. The best and least accurate classification scores are 88.6±0.2% and 71.6±0.3%, and are obtained with the settings ‘no-pooling, model large, T=100’ and ‘pooling, model small, T=50’ respectively.

By analyzing the quantized weights and buffer sizes, and the corresponding inference times, i.e. s/t-rows respectively, it can also be seen that using a more complex model architecture (i.e. a higher number of filters per convolutional layer), or using a larger input size, has a significant impact on the required memory and inference time as well. Doubling the number of filters per convolutional layer, or doubling the input size of the network, significantly boosts the classification performance but comes with the cost of an increased memory usage and inference time. The reason for the increased performance is simply because a more complex model architecture can learn more details from the data, and that a larger input size provides more data to the CNN model to rely on for the prediction.

Another important observation that can be made is that the ‘no-pooling’ model architecture yields a higher classification performance, a three to four times faster inference speed, and a four times smaller memory footprint compared to its ‘pooling’ counterpart when deployed on the i.MX RT1064 platform. The reason for the smaller memory footprint is that the larger filter stride in the convolutional layers (i.e. (1 × 4) in this work) directly produce smaller feature maps compared to a unit filter stride. In addition, using a larger filter stride also yields fewer arithmetic operations during the convolution operation, and in combination with the fact that no pooling layers are required to reduce the feature maps makes that the inference time of the network can be significantly reduced.

Results!

Table 1: Obtained results for the two model architectures. The u/q-rows are the classification scores for the unquantized and quantized models, the s-rows are the quantized model weights and buffers sizes, and the t-rows are the inference times to classify one input segment on the i.MX RT1064 development board.

Future work

Future research will mainly focus to the development of an on-edge model adaptation strategy using a fixed-point learning scheme. This allows the model to adapt its model parameters and to learn new sound events on-the-fly in a real-time setting from newly collected samples.

Lode Vuegen
Lode Vuegen
Postdoc

Applications of Artificial Intelligence and Machine Learning.