Blog Post

How we made our ML audio pipeline 37% faster on real hardware

Embedded Devices

Jan Jongboom

November 11, 2020

At Edge Impulse we enable developers to build and run their machine learning on small devices, think microcontrollers running at 80MHz sporting 128K of RAM detecting audible events. And because we operate on sensor data, it's important that our algorithms can run in real-time; no-one wants to have their device activate a minute after you say a keyword.

To accomplish that we use a wide variety of optimizations: we use the vector extensions in the hardware (f.e. using CMSIS-DSP and CMSIS-NN on Arm cores) to make our mathematical operations fast, use quantization and our EON Compiler on our neural networks to reduce the memory footprint of the network, and then choose sane defaults for parameters to make sure we can always use code paths that are hardware accelerated - like picking 32 filters over the standard 40 filters when calculating an audio spectrogram if your hardware can only do fast power of 2 FFTs.

But we can always do better! Larry Bank - nicknamed Mr. Performance for a reason - recently joined us on our quest to optimize our embedded SDK even further: reducing our algorithmic complexity, making our code more SIMD friendly, and making our code aware of the target architecture. We'll be sharing more amazing performance improvements over the next weeks (lots more on this in our EON webinar on Dec. 9th), but his work on our audio pipeline has now landed and is available for all users (just re-export your project from the Studio to get the latest SDK release).

Here's Larry's analysis of the project for a detailed look into the optimizations:

-

Last week I was invited to work with the Edge Impulse team to improve the performance of their machine learning code on microcontrollers. The main focus of my work so far has been the audio preprocessing (DSP) code. This code first takes the audio samples and converts them from the time domain into the frequency domain with a FFT. The samples are then treated as a 2D matrix of frequencies changing over a period of time. These 2D values are then put through various mathematical operations. One of the things I noticed in a few places was that the matrices were being transposed (flipping the x with the y axis) so that the inner loop of the math code would operate along a row instead of along a column. 

The reason for this has to do with memory access on systems with cache memory. On ‘big’ CPUs like x86 and Arm Cortex-A, memory is slow compared to the CPU, so the cache tries to mitigate the latency. If you read memory in the ‘wrong’ direction such as reading 1 byte from each cache line, you’ll cause tremendous delays compared to reading in the natural direction. This is the reason for the 2d matrix transposes before and after math operations - to get the columns to be rows so that ‘big’ CPUs with SIMD could read and process groups of values faster. Since this specific code will be running on MCUs with tightly coupled memory and no useful SIMD instructions, the matrix transpose operations burn cycles for no benefit. 

Another area where some time was saved was in the inner loops of the math operations. For example, when calculating the mean value of a column, the sum variable was an array element like this:

sums[row] = 0;
for (col=0; col<columns; col++)
  sums[row] += matrix[col][row];

On first glance it looks perfectly fine, but even if the compiler is smart about keeping the current sum in a register, it will assume that the sums[] array is shared and another thread might need the value to be current (volatile). It will write every change to memory as it occurs. A faster way to do this would look like this:

float sum = 0.0f;
for (col=0; col<columns; col++)
   sum += matrix[col][row];
sums[row] = sum;

I tweaked a few other spots in the code, but most of the gains came from these changes. A few things to remember about optimization on the Cortex-M4/M7:

  • The SIMD instructions are only 32-bits wide and of limited benefit.
  • Most Cortex-M MCUs don’t use cache memory, only TCM.
  • Total instruction count is the metric that matters most.

-

And with that, the numbers! Running the MFCC processing block, which we use for recognizing human speech (Larry's work also applies to MFE and spectrogram processing blocks), on a Cortex-M4F running at 80MHz, analyzing a 1 second window, with 32 filters and with filterbank quantization disabled previously took 254ms. With the new optimizations the same process only takes 161ms. - a reduction of 37%. While using the exact same amount of memory and with the exact same accuracy.

Note: You don't need to calculate the full MFCC features for every window, we have some clever tricks to reduce inferencing time when sampling continuously. See Continuous audio sampling for more information.

Truly amazing work, and we can't wait to share the rest of Larry's performance work with you. To get the latest optimizations just re-export your project from the Deployment tab in the Edge Impulse Studio. Haven't built an embedded ML model yet? See our tutorial on Recognizing sounds in audio! We can't wait to see what you'll build. 🚀

-

Jan Jongboom is the CTO and cofounder of Edge Impulse. When it comes to Embedded Machine Learning he believes that every millisecond saved is a millisecond earned.

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter