Building a Gesture-Driven Interface Using YOLO-Pro

The jury is still out on how we will control the devices around us as the age of physical computing matures and integrates more deeply into our daily lives. Touchscreens and voice control are among the most popular options today, but there is plenty of room for more intuitive interfaces. As these systems become more pervasive, the industry must also prioritize a higher standard of accuracy to eliminate the friction of misinterpreted commands. Furthermore, there is an urgent need for robust privacy frameworks that can sense our intentions without making us feel like we are under constant surveillance.

Hand gesture recognition is a good candidate for the next generation of smart devices. Unlike voice control, which often turns into a shouting match when the system does not properly understand a request, gesture systems can be tuned to predictably respond to a predefined set of deliberate, high-precision movements. Moreover, using an edge AI computing platform, all data can stay local, eliminating all privacy-related concerns.

The RUBIK Pi 3 development board

While this all sounds good on paper, there are some issues with today’s gesture recognition systems. Chief among them is that when using a camera-based solution, the user must be visible to the device for operation. That can be more than a minor annoyance when trying to control a device across the room. Move just slightly out of view, and those hand gestures will all be for naught.

Naveen Kumar thinks that this problem can be overcome with a more intelligent gesture recognition system. If the system could not only recognize gestures, but also track the user’s hands and move its camera to follow them, then we would no longer have to worry about getting ourselves in view of the device. To test out how well this idea would work in the real world, Kumar designed a proof of concept hand-tracking gesture recognition robot.

High-Performance Hardware for the Edge

The robot is powered by a RUBIK Pi 3 development board featuring the Qualcomm Dragonwing QCM6490 System-on-Chip. This SoC combines an octa-core Qualcomm Kryo CPU, an Adreno 643 GPU, and a Hexagon 770 DSP that includes a dedicated AI accelerator capable of delivering up to 12 trillion operations per second. That onboard neural processing unit allows the system to run advanced machine learning models locally, avoiding the performance bottlenecks and privacy risks associated with cloud-based inference.

The hand-tracking camera

For actuation, the robot relies on servo motors controlled through a SparkFun Servo pHAT over an I2C connection. Two DFRobot DSS-P05 standard servos handle the pan and tilt motion, enabling the camera to physically track a user’s hands in three-dimensional space. A DFRobot pan-tilt bracket assembly provides a sturdy mechanical foundation, while custom 3D-printed mounting components ensure the system remains stable during operation.

Visual input comes from an Elecom 5MP webcam mounted to the pan-tilt assembly. The entire mechanical stack, from base plate to camera mount, has been modeled in a Unified Robot Description Format (URDF) file. This URDF defines the robot’s kinematic chain, joint limits, coordinate frames, and mesh geometry, ensuring accurate transformations between the camera, pan, and tilt axes. The model can be visualized and validated in RViz2, allowing developers to confirm alignment and motion constraints before deploying code to real hardware.

Teaching a Robot to See

To teach the system how to recognize hands, Kumar turned to Edge Impulse Studio, a platform designed for building and deploying embedded machine learning models. After creating a project and connecting the RUBIK Pi 3 using the Edge Impulse CLI, he collected 357 labeled images under varying lighting conditions, distances, and hand poses.

A sample of the training data

These images were uploaded and annotated through Edge Impulse’s labeling tools, creating a dataset that reflects real-world variability. The goal was to ensure the model could generalize beyond a single environment or user, detecting hands across different scales and backgrounds.

Within Edge Impulse Studio, Kumar configured a custom impulse — a processing pipeline that preprocesses data before passing it into a machine learning model. In this case, images were resized to 224×224 pixels and processed in RGB color space, balancing performance with sufficient spatial detail for object detection.

For the learning block, the project uses Edge Impulse’s latest YOLO-Pro object detection model, chosen for its ability to detect objects at multiple scales while running efficiently on embedded hardware. Basic spatial and color augmentations were applied during training, helping the model become more resilient to variations in lighting, orientation, and background clutter.

An object detection impulse

After training the model, the system optimized a quantized (int8) model compatible with Qualcomm’s AI accelerator. Quantization reduces model size and improves inference speed while preserving most of the model’s predictive power, which is essential for edge deployments.

On the training dataset, the quantized model achieved a precision score of approximately 94.4 percent. More importantly, when evaluated on unseen test data, the model demonstrated an accuracy of around 97.1 percent, indicating strong generalization beyond the original training images.

Live classification tests inside Edge Impulse confirmed that the model could reliably detect hands in real time. Bounding boxes remained stable across movement, distance changes, and moderate lighting variation, which is exactly the kind of performance needed for a responsive gesture-driven interface.

Results from the test dataset

For deployment, the model was exported as an Edge Impulse Model binary built specifically for Linux on AARCH64 with Qualcomm QNN support. Because Qualcomm’s AI accelerator does not support floating-point models, the quantized int8 variant was selected to ensure compatibility and optimal runtime performance.

Closing the Loop: From Perception to Precision Motion

The Edge Impulse Linux SDK for Python enables on-device inference, allowing ROS 2 nodes to feed camera frames directly into the model and retrieve detection results with minimal overhead. Qualcomm’s AI runtime libraries provide low-level access to the neural accelerator, ensuring that inference remains fast, power-efficient, and fully local.

The object detection node processes incoming camera images, runs them through the Edge Impulse model, and publishes detected bounding boxes as ROS 2 messages. A dedicated hand tracker node then interprets those detections, calculating the hand’s position relative to the image center.

Testing the model using Live Classification

Using a pair of PID controllers — one for pan and one for tilt — the tracker computes incremental adjustments to the camera’s orientation, smoothly following the user’s hand while respecting servo speed limits and mechanical constraints. The control signals are published as joint position commands, which flow through the ros2_control framework to a custom C++ hardware interface.

This hardware interface translates joint angles into precise PWM pulse widths for the servo motors, enabling responsive, low-latency actuation. The result is a closed-loop system in which perception and motion reinforce each other: the camera tracks the hand, improving visibility, which in turn stabilizes detection.

By combining the Edge Impulse platform with powerful and accessible hardware, the project demonstrates that better gesture recognition systems might be right around the corner. Beyond gesture control, Kumar’s system could support future applications in collaborative robotics, assistive technology, telepresence, and remote manipulation.

Do you have any ideas in mind for a hand-tracking robot? If so, be sure to take a look at the project write-up for more details.

Comments

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter