Tiny LLM Benchmark: Jetson Orin Nano Super 8GB

orbanlevi1 pts0 comments

Tiny LLM Benchmark: Jetson Orin Nano Super 8GB - Yuvraj Singh Portfolio

You are using an outdated browser. Please upgrade your browser to improve your experience.

Four Power Modes × Eight Models: llama.cpp vs Ollama

Platform: NVIDIA Jetson Orin Nano Super 8GB

CPU: 6-core Arm Cortex-A78AE · GPU: NVIDIA Ampere (1024 CUDA cores, 32 Tensor cores)

Memory: 8 GB LPDDR5 shared CPU+GPU · JetPack: R36.4.7 (L4T 36.4)

Backends: llama.cpp CUDA (-ngl 99, --no-cache-prompt) · Ollama (CUDA, matched quantizations)

Runs: llama.cpp - four full sweeps: 7W , 15W , 25W , MAXN_SUPER · Ollama - 7W , 15W , 25W , and MAXN complete

Sweep: prompt ∈ {128, 512, 1024, 2048} tok × gen ∈ {64, 128, 256} tok × 20 reqs/combo

Concurrency: 1 (single-user)

Key metric: output tok/J = OSL ÷ (decode_power_W × p50_decode_s) - decode-phase energy only

Raw data on Hugging Face - complete per-cell JSON exports (all 33 metrics, 12 prompt×gen combos × 20 requests per cell, profile_export_aiperf.json + tegrastats.log + server logs):

llama.cpp

Mode<br>Dataset<br>Models<br>Cells

7W<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-7w<br>96

15W<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-15w<br>96

25W<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-25w<br>96

MAXN<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-maxn<br>96

Ollama

Mode<br>Dataset<br>Models<br>Cells

7W<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-ollama-7w<br>96

15W<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-ollama-15w<br>96

25W<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-ollama-25w<br>96

MAXN<br>YuvrajSingh9886/jetson-non-reasoning-benchmark-ollama-maxn<br>96

Github repo with all code, scripts, and plotting notebooks can be found here

My mini rack of 3× Jetson Orin Nano Super 8GB - $750 of edge inference crammed into a shoebox.

Version caveat: All results are specific to the tested software versions — llama.cpp build b9292 (commit ef570f630, CUDA backend) and Ollama v0.24.0 (default GPU offload).

Ollama v0.24.0 was the only latest supported version that loaded all GGUFs (especially LFM2.5) across all eight models without failures on JetPack R36.4.7. Ollama v0.24.0 vendors llama.cpp at commit ec98e2002 (Dec 2025, ~5 months older than the standalone b9292 build) and was fixed in this version since the test was conducted in early June.

In particular, Ollama’s GGML CUDA backend for LFM2.5 models may improve in future versions. Re-benchmark before drawing conclusions about current versions.

GPU offloading: Ollama loaded all models with 100 % GPU offload (confirmed via ollama ps). No layers fell back to CPU. The performance gap is not caused by partial GPU offloading — it reflects differences in CUDA kernel efficiency and server overhead between the two backends at identical GPU utilisation.

Executive Summary

Eight models were benchmarked across all four Jetson Orin Nano Super power modes under llama.cpp CUDA and, for a direct backend comparison, under Ollama (matched quantizations) at all four power modes. Each model ran 12 combinations of prompt × generation length (20 requests per combo) at every power mode where it could load.

Key finding: 25W (nvpmodel -m 1) is the paretto sweet spot for every model under llama.cpp. It delivers 35-47 % more output tok/s than 15W while pushing output tok/J 1-7 % higher than 15W and 9-23 % higher than MAXN_SUPER across every model (ctx=2048, gen=256, corrected decode-phase tok/J).

Backend finding: llama.cpp outperforms Ollama by 36-74 % on throughput for sub-1B transformer models, with proportionally higher tok/J. Qwen3-0.6B and Llama3.2-1B are the exception - nearly identical across backends (~1-6 % difference at all four power modes). LFM2.5-350M suffers most under Ollama (3.35× slower than llama.cpp at 15W, 4.2× at 25W).

Sub-1B standouts at 25W llama.cpp:

SmolLM2-135M - 165.2 tok/s , 29.6 output tok/J (best in suite), 101 MB, ~5.6 W: runs on a USB-C power bank

LFM2.5-350M - 115.4 tok/s in only 219 MB: competitive with SmolLM2-360M (369 MB) at 60 % of its size

~1B class at 25W llama.cpp (ctx=2048, gen=256):

LFM2.5-1.2B leads on throughput (54.1 tok/s , 15 % ahead of Llama3.2-1B, 33 % ahead of Gemma3-1B) in the smallest footprint (698 MB)

Gemma3-1B edges ahead on total tok/J (118.5 vs 116.2) thanks to lower power draw (6.82 W vs 8.52 W)

Throughput winner at each mode (ctx=2048, gen=256, highest sweep point):

Table 1: Throughput and efficiency winner at each power mode (ctx=2048, gen=256)

llama.cpp / CUDA:

Mode<br>Fastest model<br>Output Tok/s<br>Output Tok/J†

7W<br>smollm2-135m<br>53.8<br>27.0 ‡

15W<br>smollm2-135m<br>114.7<br>27.58

25W<br>smollm2-135m<br>165.2<br>29.62

MAXN<br>smollm2-135m<br>159.5<br>24.72

Ollama (matched quantizations):

Mode<br>Fastest model<br>Output Tok/s<br>Output Tok/J†

7W<br>smollm2-135m<br>36.4<br>19.21

15W<br>smollm2-135m<br>84.4<br>20.14

25W<br>smollm2-135m<br>120.6<br>21.26

MAXN<br>smollm2-135m<br>132.2<br>18.65

† Output tok/J = OSL ÷ (decode_power_W × p50_decode_s) - decode-phase energy only (corrected method).

‡ 7W llama.cpp approximated: no tegrastats retained for that run; older avg-power method...

ollama llama jetson benchmark smollm2 power

Related Articles