Softmax-free ~354M: tile-skip kernels for long-context VRAM savings (sparse)

Tripstoph1 pts0 comments

Tripstoph/RRT-Foundation · Hugging Face

Log In<br>Sign Up

RRT-355M — softmax-free attention at GPT-2 Medium scale

Headline result: a GPT-2 Medium–shaped checkpoint (~354 M parameters) trained from scratch without softmax , evaluated on a standardized 22-task in-context learning benchmark , with open kernels where sparse inference is bit-identical to dense on this checkpoint.

This Hugging Face repo ships weights, config, and substrate constants only. Inference requires the RRT engine on GitHub (RRT-LLM-FOUNDATION, AGPL-3.0). Stock transformers GPT-2 will produce incorrect outputs.

Training is complete . No additional checkpoints are planned from this repository.

Capability evaluation (22-task CORE)

Model<br>CORE<br>Notes

GPT-2 124M<br>0.1211<br>floor reference, same harness

GPT-2 medium<br>0.1770<br>dense softmax foil, matched scale

RRT-355M<br>0.1558<br>softmax-free, this checkpoint

Pythia 410M<br>0.1895<br>modern baseline, same harness

CORE = mean centered accuracy across 22 in-context learning tasks (DCLM protocol, Karpathy nanochat eval_bundle). RRT-355M is 0.021 below the GPT-2 medium foil and 0.035 above the GPT-2 124M floor — a measurable tradeoff, not a capability collapse.

Task asymmetry (RRT − GPT-2 medium, centered score): gains on multiple-choice reasoning (arc_easy +0.12, agi_eval_lsat_ar +0.09, openbook_qa +0.07); largest regressions on continuation tasks (lambada_openai −0.16, coqa −0.13, squad −0.07).

Not evaluated: MMLU, GSM8K, HumanEval, chat/instruction benchmarks, or fine-tuned downstream tasks. Details: eval/eval_summary.json on this repo; full write-up on GitHub docs/EVALUATION.md.

Mechanism and training

Metric<br>Value<br>Notes

Structural edge sparsity<br>99.66 %<br>fidelity gate; training measurement

Training data<br>FineWeb-Edu<br>11.534 B tokens, 4× H100, 22k iters

Best val loss (ckpt)<br>2.8001<br>iteration 21 000

Weight file<br>~1011 MB bf16<br>model.safetensors

Three metrics — do not conflate: (1) structural sparsity during training, (2) coarse-tile skip at inference (34–55%, long context), (3) CORE behavioral score above.

Each attention edge applies friction ln(max(i−j, 1)) and gate μ = η / (1 + η^n)^(1/n) with n = 1.25. INT8 pre-pass skips tiles with no active edges; bit-identical to dense on this checkpoint. v2 kernel: 21/22 CORE tasks identical to v1 (Δ CORE −0.0016).

Systems notes (secondary)

Metric<br>Value<br>Caveat

INT8 tile skip @ T=2048 / 8192<br>34% / 55%<br>layer-12 micro-bench, H100

Kernel vs SDPA @ T=2048<br>11.5×<br>not end-to-end generation

Peak attention VRAM @ T=16384<br>5.5 GB<br>GPT-2 XL reference forward, RTX 3070

Files in this repo

File<br>Purpose

model.safetensors<br>bf16 weights

config.json<br>architecture metadata

rrt_substrate_constants.json<br>inference requires n_backbone, C_max only

eval/<br>CORE summary JSON, comparison CSV, parity notes

figures/<br>key charts from benchmark report

tokenizer_pointer.txt<br>openai-community/gpt2 BPE

Reproduce

git clone https://github.com/tripstoph/RRT-LLM-FOUNDATION.git<br>cd RRT-LLM-FOUNDATION<br>pip install -e .<br>python eval/run_core_eval.py --model rrt:_state/ckpt.pt --snapshot-dir engine --seed 1337<br># Quick smoke (~minutes): python eval/smoke_core.py --model rrt:_state/ckpt.pt --snapshot-dir engine

Expected full CORE: 0.1558 . Claims ↔ evidence: GitHub docs/CLAIMS.md.

Scope

RRT-355M validates the attention mechanism in isolation. Broader pipeline work is explored separately under Relational Autopoietic Substrate (RAS) ; no timeline or additional model releases are committed from this repository.

Limitations

Custom Triton engine (Hopper sm_90); not AutoModelForCausalLM

CORE below dense GPT-2 medium at matched scale

Single checkpoint; no scale-up from this repo

Speed/memory figures are kernel benchmarks with stated context

Citation

@misc{rrt-355m-2026,<br>author = {Tripstoph},<br>title = {RRT-355M: Softmax-free attention at GPT-2 Medium scale},<br>year = {2026},<br>publisher = {HuggingFace},<br>howpublished = {\url{https://huggingface.co/Tripstoph/RRT-Foundation}},<br>note = {Proof-of-mechanism weights; engine at GitHub under AGPL-3.0.},

Last updated: 2026-06-21

Downloads last month -

Safetensors<br>Model size<br>0.5B params

Tensor type<br>BF16 ·

Files info

Inference Providers NEW

Text Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Tripstoph/RRT-Foundation<br>Paper • 2406.11794 • Published Jun 17, 2024 • 56

core model medium softmax foundation 355m

Related Articles