Willow 5 Benchmarks: The New Action Recognition & Detection Standard
Willow 5 MOGEN’s deterministic physics engine achieves a sustained hot-inference latency of 0.17ms per frame, establishing the new global standard.

By decoupling the Vision Layer (pixel extraction) from the Physics Layer (motion reasoning), Willow 5 executes action recognition 10x to 100x faster than current State-of-the-Art (SOTA) neural networks.
Testing Methodology & Hardware Disclosure
These microsecond latencies were achieved within an interpreted language (Python) on a non-real-time consumer operating system, proving the efficiency of our compiled C++ kernels.
- Host OS: Windows 11 Home/Pro
- Runtime: Python 3.13 (Virtual Environment)
- Host Hardware: 13th Gen Intel® Core™ i5-13500HX (Performance-Grade Mobile Silicon)
- Execution Profile: Single-threaded CPU (Only one core utilized; No GPU/NPU acceleration required)
- Inference Mode: Hot-Inference (Post-JIT Compilation)
- Benchmark Baseline: 0.17ms sustained latency per frame
Model Complexity & Scalability
The benchmarked performance was achieved using a 12-point "Sport-Standard" model (Jumping Jack), tracking the Torso, Arms, and Legs. Notably, this model was sourced from an empirical 2D video captured on a mid-range Samsung Galaxy mobile device, proving the engine's resilience to commodity sensor noise. Signal input/fidelity (biomechanics lab rig or perfect synthetic data) does not affect performance.
- Joint Complexity: The engine scales mathematically. While the 12-point model performs ~66 distance calculations per frame, a maxed-out 75-point "Robotics-Grade" model (including high-fidelity finger tracking) performs 2,775 calculations.
- Performance Projection: Even at maximum 75-point complexity, the Willow Engine maintains a sub-millisecond hot-inference latency (~0.7ms), remaining significantly more efficient than traditional AI architectures on the same hardware.
Categorized Benchmarking
Tier 1: Motion Reasoning (Peer-to-Peer: Coordinates-to-Action)
Direct architectural comparison against models that consume skeletal data.
- ST-GCN, Shift-GCN, Tiny-GCN
- The Gap: Even the most optimized "Tiny" academic models require 1.2ms to 3.0ms per frame. Willow 5 MOGEN is 7x to 15x faster than the fastest academic models and 100x faster than industry-standard ST-GCN (~15-20ms).
Tier 2: End-to-End Perception (System-Level: Pixels-to-Action)
Functional comparison against "All-in-One" models that process raw video feeds.
- Meta’s SlowFast, Google’s MoViNet-A0.
- The Gap: These require high-end edge GPUs and latencies of 50ms to 150ms. Willow’s total pipeline (Vision + Physics) executes in ~15ms, providing a 3x to 10x speed advantage on commodity hardware.
Tier 3: Strategic Autonomy (Cloud-Tethered vs. Local Edge)
Economic comparison against foundation model APIs.
- Gemini, GPT
-
The Gap: Latencies range from 2,000ms to 5,000ms. Willow executes in 0.00017 seconds locally, offline, and at zero per-frame cost.
Benchmark Data Table
|
Architecture |
Input Type |
Complexity (FLOPs) |
Edge Latency (Per Frame) |
|---|---|---|---|
|
Willow 5 MOGEN |
3D Skeleton |
~3,000 |
0.17 ms |
|
Tiny-GCN |
3D Skeleton |
~20,000,000 |
1.2 ms – 3.0 ms |
|
ST-GCN |
3D Skeleton |
~16,000,000,000 |
15.0 ms – 50.0 ms |
|
MoViNet-A0 |
RGB Video |
~600,000,000 |
50.0 ms – 80.0 ms |
The Architecture
1. Relational Distance Matrix (RDM) vs. Neural Weights
Neural Networks must multiply millions of weights to "infer" an action. Willow is Weightless. We utilize a 75-Point Topology to calculate the 3D distance between joint pairs. Mathematics is simply faster than guessing.
2. Complexity Optimization
Willow utilizes Space-Optimized Subsequence DTW. We only store the "Current" and "Previous" columns of the math grid in RAM. Our compute cost is tied to the length of the model, not the length of the video stream.
3. Numba-Compiled C++ Kernels
Willow 5 MOGEN utilizes the LLVM Compiler to bypass Python's interpreted loop overhead. Our math kernels are compiled into raw Assembly instructions, talking directly to the CPU's mathematical registers at the silicon level.
4. The Zone Bitmask (Hardware-Level Efficiency)
The Willow binary header contains a 32-bit Zone Bitmask. This acts as a binary "on/off" switch for body parts. The SDK reads this and instructs the camera to disable unused tracking points, preserving CPU and battery life.
Engineering Disclosures
- Headless Performance: This figure represents the Willow Inference Engine only. It does not include the Vision Layer (e.g., MediaPipe), though the speed delta remains constant across all skeleton-based competitors.
- The Single-Frame Confirmation Standard: To ensure absolute trigger precision and eliminate false positives, the SDK utilizes a "falling edge" peak-detection algorithm. While traditional neural networks require sliding windows of 30–60 frames to verify an action, Willow requires only a single frame of context to confirm a peak. This represents the absolute theoretical minimum for deterministic recognition.
- JIT Cold Start: Upon the very first execution, the Numba compiler requires a one-time "Cold Start" (~0.8s) to compile the math. All subsequent frames execute at the 0.17ms "Hot" rate. In less than one second (~0.8s), the engine compiles the kinetic logic directly into the CPU's local instruction set. This pay-once, run-forever architecture ensures that for the remainder of the session, spatial reasoning occurs at the speed of silicon.
- Left-Biased Normalization: Our RDM math utilizes a left-side kinetic chain hierarchy to ensure scale invariance across users of all sizes.
Try the Benchmark Yourself
Engineers can clone our reference repository, run the headless stress test, and verify these microsecond latencies on their own hardware.