Voice Command Recognition on Rubik Pi 3: Getting Started Guide

If every smart speaker streamed audio to the cloud 24/7, waiting for servers to detect "Hey Alexa." At 16 kHz, that's 77 GB per device monthly; before transcription costs, that can hit thousands of dollars per device. Network latency, privacy concerns from constant audio streaming, and the infrastructure needed to support millions of devices is not feasible. 

The smarter approach, used by Amazon, Google, and many embedded AI vendors, is on-device wake-word detection. Tiny models listen locally, trigger cloud processing only when needed, and dramatically reduce bandwidth, latency, and privacy exposure.

In this guide, we’ll show how to implement a fast, reliable, on-device wake-word detector for the Rubik Pi 3 using Edge Impulse. Just say the word “Hey Rubik” and watch the device wake up and get to work, like a mini AI genie at your command.

Audio fundamentals for machine learning

Human speech is composed of phonemes (basic sound units) shaped by the vocal tract. Vowels are characterized by resonant frequencies called formants, and consonants by brief bursts or noise. To a machine, speech is a waveform that can be digitized and transformed.

Real-world voice data can be messy. Background noise, cross-talk, and echoes can confuse a detector. Speaker variation (gender, age, accent, dialect) also matters indeed; studies find 66% of users report accent issues with speech systems. Robust datasets must include noise samples and diverse voices so the model learns to ignore irrelevant sounds and accents.

To summarize, we’ll record at 16 kHz/16-bit, convert each clip to Mel-frequency features (like MFCCs or MFE spectrograms), and be mindful of noise and voice variation as we train our model.

Wake word design strategy

We’ve chosen “Hey Rubik” as our wake word. A good wake word is phonetically distinct and easy to say. Designers recommend short phrases (typically 1–3 syllables) that are not common words to avoid false triggers. This is why major voice assistants typically use longer phrases.To make our wake word work, we must emphasize uniqueness and handle false positives carefully.

In short, “Hey Rubik” is an ideal wake word: multi-syllabic, phonetically distinctive, and unlikely to appear in normal conversation. Because of these advantages, it reduces the risk of false triggers and simplifies the training process.

Implementation

In this section, we will see how to develop a wake-word model in Edge Impulse. First, we will set up an Edge Impulse on the Rubik Pi 3.

Setting up Edge Impulse on the Rubik Pi 3

In our workflow, Rubik Pi 3 will be used to collect microphone data and perform on-device inference. We have attached a simple USB microphone that’s compatible with the Raspberry Pi. Here’s all the equipment used:

The USB-C Microphone (On the Left). The SCB Rubik Pi-3 indicates the connection points for all required components (On the Right).

Below is the setup process for installing the Edge Impulse Linux CLI on the Rubik Pi 3 so it can receive compiled models and run inference locally.

1. Preparing the Rubik Pi 3

If your Rubik Pi 3 is running Ubuntu 24.04, connect a USB keyboard, a USB microphone, and an HDMI display, then power it via the right-side USB-C port and press the front power button. When the console appears, sign in using ubuntu for both the username and the password. 

If your device is not running Ubuntu 24.04, refer to the official Thundercomm Rubik Pi 3 instructions provided in the Edge Impulse documentation.

2. Installing the Edge Impulse Linux CLI

Once your Rubik Pi 3 is online and rebooted, install the required dependencies:

sudo apt update


curl -sL https://deb.nodesource.com/setup_20.x | sudo bash -
sudo apt install -y gcc g++ make build-essential nodejs sox \
  gstreamer1.0-tools gstreamer1.0-plugins-good \
  gstreamer1.0-plugins-base gstreamer1.0-plugins-base-apps


sudo npm install edge-impulse-linux -g --unsafe-perm

This installs Node.js, audio/GStreamer utilities, build tools, and the Edge Impulse CLI itself.

3. Connecting your Rubik Pi 3 to Edge Impulse

Run the device agent:

edge-impulse-linux

A setup wizard will appear, prompting you to:

We will use this after creating the model. If you ever need to reset the configuration or switch projects, run:

edge-impulse-linux --clean

Since our wake-word project only requires audio input, you can start the Edge Impulse agent with the camera disabled:

edge-impulse-linux --disable-camera

Next, verify that your Rubik Pi 3 is using the correct microphone. List all audio capture devices with:

arecord -l

This will display the available sound cards and device numbers, e.g., 

card 1: Device [USB Microphone], device 0: subdevice 0.

Once you know the correct device, update the Edge Impulse configuration to use it:

nano edge-impulse-config.json

In the JSON file, set the "audio_device" field to match your microphone, for example:

"audio": "hw:1,0"

Save the file and restart the agent. The Edge Impulse runner will now capture audio from the correct microphone for live inference or data acquisition.

4. Verifying device connection

Open your project in Edge Impulse StudioDevices. You should now see your Rubik Pi 3 listed as an active Linux target. 

Rubik Pi 3 Device Detected Among Your Devices

Once this is complete, your board can receive impulse bundles, run real-time inference, and act as the deployment target for your wake-word detection pipeline.

Edge Impulse project setup and configuration

First, create a new Edge Impulse project in the Edge Impulse Studio. 

Once set up, open the Data acquisition tab in the Studio. Here, you define how to record samples from your device.

For example, to record a 2-second clip:

  1. Once the device is detected, go to the Data Acquisition tab to start sampling audio from the microphone.
Sampling Audio from the Rubik Pi 3
  1. Choose the sensor, Built-in microphone, or USB microphone depending on your setup.
  2. Set Label to Rubik (for our wake word) and set Sample length to 2000 (milliseconds).
  3. Click Start sampling and speak “Hey Rubik” clearly for one second.
  4. Once the recording is complete, you can divide the dataset into training and testing sets.
Performing Train/Test Split of the Dataset

After recording, Edge Impulse uploads the clip and displays its waveform in the Data Acquisition tab. You should see a new entry in the Collected data labeled Rubik. Repeat this multiple times to gather many examples. The Live Classification feature in Edge Impulse (EI) also lets you stream continuous audio and cut out windows in real time.

Building your voice dataset

A high-quality dataset is key. We want “Hey Rubik” utterances from many speakers, plus lots of negative examples.

By mixing speakers, venues, and microphone placements, the dataset will cover real-world variation.

Signal processing pipeline design

With data in hand, we configure the Impulse Design. Edge Impulse uses a two-stage model: a processing block (feature extraction) and a learning block (neural network).

The Impulse Design For the Input, MFCC Block, Classification Block, and Output Features

This block computes Mel-frequency Cepstral Coefficients from each audio window. MFCC is the standard for human speech. Edge Impulse also has MFE, which gives a mel-scaled spectrogram; this can sometimes improve non-speech tasks. 

The MFCC block will output a matrix (frames × coefficients) for each 2s clip. By default, it uses ~30 ms frames with ~10 ms overlap – good starting values.

In our case, we utilized the autotune feature, which automatically searches for the optimal parameters based on the data.

Autotuning the Parameters depending on the Data

Edge Impulse will suggest a neural network architecture. For audio, a 1D-CNN on the MFCC frames or even a dense NN can work. We’ll start with the default model replaced with 2D-convolutions.

Our Neural Network Architecture: Default Settings with 2D Convolution Layer

Together, this impulse (MFCC + NN) will extract voice features optimized for our data.

Feature engineering for voice recognition

Before training, inspect your features:

The goal is to have features that cleanly separate “Hey Rubik” from others. Good pre-processing ensures the network sees consistent input. 

Neural network training and optimization

With the Impulse designed, click Train Model

Saving and Training the Model

Edge Impulse will train the network and show metrics. Key steps:

As you train, iterate: adjust layers or add more epochs if underfitting. If the model stalls, try changing the learning rate or architecture depth. Edge Impulse’s Retrain model lets you quickly experiment. The confusion matrix can indicate problems (e.g., if “Hey Rubik” is often seen as “unknown,” you may need clearer positive examples). In our case, with the settings we chose, here are the results:

Training Performance of the Audio Spotting Model

Model testing and validation

After training, rigorously test the model:

Model Test Report Showing One Misclassification

By combining live listening and test-set metrics, we ensure the wake-word detector is reliable across conditions. Regularly repeat testing if you tweak the model, to guard against unintentional overfitting.

Model deployment and integration

Once satisfied, it’s time to run on the device:

  1. Download the model: In Edge Impulse Studio, go to Deployment, choose Linux (ARM64 with Qualcomm QNN), and click Build. This generates an .eim file (Edge Impulse Model).

Edge Impulse CLI: On the Rubik Pi 3 (with the Runner installed), open a terminal. Simply run:

edge-impulse-linux-runner

  1. This command will automatically pull the latest model from your project and start classification using hardware acceleration by asking to select an impulse and the model optimization. The console will show live predictions as follows:

ubuntu@ubuntu:~$ edge-impulse-linux-runner Edge Impulse Linux runner v1.18.2 [RUN] Already have model /home/ubuntu/.ei-linux-runner/models/848055/v2-quantized-runner-linux-aarch64-impulse2/model.eim not downloading... 


[RUN] Starting the audio classifier for haziqasajid/ PI-classification (v2) 


[RUN] Parameters freq 16000Hz window length 2000ms. classes [ 'Noise', 'Rubik', 'Unknown'


[RUN] Using microphone hw:1,0 Want to see live classification in your browser? 


Go to http://192.168.100.25:4912 Want to use predictions in your application? 

Open a websocket to ws://192.168.100.25:4912 

classifyRes 2ms. { Noise: 0, Rubik: 0, Unknown: 0 } 

classifyRes 2ms. { Noise: 0, Rubik: 0, Unknown: 0 } 

classifyRes 4ms. { Noise: 0, Rubik: 0, Unknown: 0 } 

classifyRes 10ms. { Noise: 0.7539, Rubik: 0.0078, Unknown: 0.2383 } classifyRes 11ms. { Noise: 0.7031, Rubik: 0.0039, Unknown: 0.2931 } classifyRes 11ms. { Noise: 0.7539, Rubik: 0.0313, Unknown: 0.2148 } classifyRes 11ms. { Noise: 0.875, Rubik: 0.0117, Unknown: 0.1172 } classifyRes 11ms. { Noise: 0.6006, Rubik: 0.0039, Unknown: 0.4075 } classifyRes 11ms. { Noise: 0.25, Rubik: 0.0156, Unknown: 0.7383 } classifyRes 11ms. { Noise: 0.4766, Rubik: 0.078d, Unknown: 0.4531 } classifyRes 11ms. { Noise: 0.0977, Rubik: 0.0156, Unknown: 0.8906 } classifyRes 11ms. { Noise: 0.0195, Rubik: 0.293, Unknown: 0.6914 } 

—-------------------------------------------------------------------

classifyRes 11ms. { Noise: 0.0352, Rubik: 0.8672, Unknown: 0.0977 }|

—-------------------------------------------------------------------

classifyRes 11ms. { Noise: 0.9375, Rubik: 0, Unknown: 0.0625 } classifyRes 11ms. { Noise: 0.9375, Rubik: 0, Unknown: 0.0625 } classifyRes 11ms. { Noise: 0.5625, Rubik: 0.0195, Unknown: 0.418 }

In the terminal response above, we have boxed where we said the wake word.

Manual .eim: Alternatively, copy the .eim file to the Rubik Pi via scp or USB. Then run:

edge-impulse-linux-runner --model-file downloaded-model.eim

  1. This directly loads the model (quantized if possible for the NPU) and begins inference.
  2. Testing on-device: With the model running, speak "Hey Rubik" near the microphone. You should see an output message or indicator that the Rubik class was detected with a high confidence score.

This completes the end-to-end pipeline. The Rubik Pi 3’s hardware NPU (12 TOPS) ensures that even a moderate CNN runs quickly, making real-time recognition smooth.

Testing & validation (optional)

In final testing, treat the system as a black box:

Finally, implement a confidence threshold in your application code. For example, ignore “Hey Rubik” detections unless the model confidence is ≥0.7. This avoids acting on low-confidence guesses.

A consistent user experience means reliable recognition for all accents and minimal false wakes.

Conclusion and next steps

In this guide, we built a basic on-device voice keyword spotter on the Rubik Pi 3. Key steps included understanding the audio signal (sampling, spectrograms), designing a distinct wake word, collecting a diverse dataset, and using Edge Impulse to extract MFCC features and train a neural network. We then deployed the model using Edge Impulse Runner on the Rubik Pi 3, achieving real-time voice detection via the Rubik Pi 3's hardware AI accelerator.

The Rubik Pi 3’s 12 TOPS AI engine means it can handle this task with room to spare. Your exact accuracy will depend on data quality, but with a few hundred diverse samples, we typically saw over 90% correct detection in quiet conditions.

Finally, once the wake word is caught, you could feed subsequent audio to an onboard speech-to-text engine or simple command parser for full voice control.

Get started today and see how easily you can integrate voice control into your projects!

Comments

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter