Our edge AI compiler outperforms Google and vendor toolchains. | DeepGateEdge AI tooling still lags behind the compilers and runtimes built for large GPU-based models. Most microcontroller deployments rely on Google’s TensorFlow Lite for Microcontrollers (TFLM), or vendor-specific variants – an approach we believe leaves significant performance untapped. At the edge, efficiency determines whether a model fits at all, runs in real time, or meets its power budget. Our goal is to build the leading edge AI compiler for CPUs and AI accelerators, starting with the smallest devices: microcontrollers.<br>We’re releasing the DeepGate compiler (v0.15.0), which compiles quantized .tflite models into optimized inference binaries that use up to 3× less RAM and run up to 2× faster than Google’s TFLM on Arm Cortex-M devices. In our MLPerf Tiny evaluation, a benchmark suite for tiny machine learning on microcontrollers, it outperformed TFLM across silicon from Analog Devices, Infineon, Silicon Labs, and STM, while also outperforming Infineon’s and Silicon Labs’ own toolchains on their hardware. In some cases, our compiler enabled models to run that otherwise would not fit in memory.<br>Try it out →<br>Outperforming vendor toolchains on their own hardware<br>We’ve validated the DeepGate compiler (v0.15.0) on the MLPerf Tiny v1.4 benchmark suite, the industry-standard benchmark for machine learning on microcontrollers. We ran it across four boards from four silicon vendors, with results submitted to MLPerf for independent review. The suite includes representative edge AI workloads for keyword spotting, visual wake words, image classification, and anomaly detection. Without modifying the models, our compiler uses up to 3× less RAM and runs up to 2× faster than Google’s TFLM. It also outperforms vendor toolchains: delivering up to 3× lower RAM usage and 1.8× faster inference than Silicon Labs’ TFLM Simplicity SDK on the EFR32MG24’s AI accelerator, and up to 2× faster inference than Infineon’s Imagimob on the PSoC 6. Our memory savings determine whether a model fits at all: on Analog Devices’ MAX32655, the Visual Wake Words benchmark ran out of memory under TFLM but compiled and executed successfully with the DeepGate compiler.<br>Explore every comparison below: switch boards, compare frameworks where available, and toggle between latency and RAM usage. Here, we measured RAM as the tensor arena plus peak stack size.<br>STMicroelectronicsAnalog DevicesSilicon LabsInfineon<br>DeepGate runs up to 1.9× faster<br>Google's TFLMDeepGate
STM32H7A3 Cortex-M7 @ 280 MHzGoogle's TFLMST Edge AILatencyRAM
ST Edge AI from STMicroelectronics remains highly competitive. Against its balanced compilation setting, we deliver faster keyword spotting inference (1.1× faster) and lower RAM usage on anomaly detection (1.6× less RAM), while other workloads remain a focus for upcoming releases.<br>How we did it<br>Meaningful efficiency gains require optimization across multiple dimensions, so we optimized our compiler across all of them: it compiles to static binaries rather than a runtime interpreter, plans whole-graph memory allocation at compile time, and applies hardware-aware kernel optimizations beyond Arm’s standard CMSIS-NN kernels, including custom assembly routines tuned through hardware-in-the-loop testing.<br>Google’s TFLMDeepGate compilerSetupManual op registration and arena sizingAutomaticExecutionRuntime interpreterStatically compiled binaryMemory planningArena manually sized, greedy buffer reuseArena optimally laid out at compile timeKernelsARM CMSIS-NNCustom assembly, hardware-in-the-loop tuned<br>What makes the DeepGate compiler different<br>We’re still early in our optimization roadmap, with significant opportunities remaining in areas such as memory planning and kernel optimization. We’re also expanding support for approaches that existing edge AI toolchains often underserve, including sparse networks, lower-bit quantization, and efficient attention mechanisms for Transformer models. Looking further ahead, we are co-designing our compiler around DeepGate’s novel ML building blocks, which reduce reliance on costly matrix multiplications and enable greater use of in-place computation – paving the way for models fundamentally better suited to constrained hardware.<br>What’s next<br>Today our compiler targets Arm Cortex-M CPUs and selected embedded AI accelerators, and we’re actively expanding that support. We’d love to hear which targets matter most to you. Sign up for updates, request platform access, or get in touch if there’s a device you’d like us to support next.<br>Sign up for updates →
←All posts