At Edge Impulse we enable developers to build and run their machine learning on small devices, think microcontrollers running at 80MHz sporting 128K of RAM doing keyword spotting. And because we operate on sensor data, it's important that our algorithms can run in real-time; no-one wants to have their device activate a minute after you say a keyword.
To accomplish that we use a wide variety of optimizations: we use the vector extensions in the hardware (f.e. using CMSIS-DSP and CMSIS-NN on Arm cores) to make our mathematical operations fast, use quantization on our neural networks to reduce the memory footprint of the network, and then choose sane defaults for parameters to make sure we can always use code paths that are hardware accelerated - like picking 32 filters over the standard 40 filters when calculating an audio spectrogram if your hardware can only do fast power of 2 FFTs.
But apparently, we can always do better! Arjan joined us a month or two ago as a firmware engineer - amongst other things helping to get us the most out of Eta Compute's silicon, which has some other very interesting optimizations utilizing an external DSP and voltage and frequency scaling - and he spotted our use of the
log() function to convert audio frequencies.
log() function gets pulled in from libm, which already utilizes hardware optimizations (e.g. if you multiply two floats it will utilize the FPU if you have one), so I figured this was OK. But... this doesn't always hold up for natural algorithms. These are always an approximation, and most implementations tend to value accuracy over computation time. In our case however we'd happily trade a small loss in accuracy for speed.
For some operators (
sin) Arm has made some optimizations for other functions in CMSIS-DSP's Fast Math library, but for a fast
log function we need to head to njuffa's excellent answer on StackOverflow which calculates a pretty accurate approximation. Plugging it in to our inferencing SDK immediately makes our pipeline 6.6% faster with negligible effect on our accuracy on the Arduino Nano 33 BLE Sense.
With fast log (Arduino Nano 33 BLE Sense)
Predictions (DSP: 464 ms., Classification: 14 ms., Anomaly: 0 ms.): no: 0.01172 noise: 0.03516 yes: 0.94922
Without fast log (Arduino Nano 33 BLE Sense)
Predictions (DSP: 496 ms., Classification: 14 ms., Anomaly: 0 ms.): no: 0.05078 noise: 0.05078 yes: 0.89844
What's interesting is that the target seems to make a big difference. An ST device with similar clock speed (both Cortex-M4F, ST running at 80MHz vs. 64MHz on the Arduino) and the same compiler only yields us a 1% faster pipeline (283ms. with the fast log vs 285ms. without the fast log). Perhaps some FPU ABI settings? Faster flash? We'll look into it! Until then, you'll enjoy faster performance and longer battery life on the same hardware.
Inspired? Here's how you build your own audio model that'll run on a microcontroller - from initial data capture, to building a DSP and machine learning pipeline, to running real time classification on an MCU.
Jan Jongboom is the co-founder and CTO of Edge Impulse. He has spent too much time optimizing signal processing code.