Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

GitHub - chiennv2000/orthrus: Fast, lossless LLM inference via dual-view diffusion decoding. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

chiennv2000

orthrus

Public

Notifications You must be signed in to change notification settings

Fork

Star 21

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 11 Commits 11 Commits

assets

src

.gitignore

LICENSE

README.md

pyproject.toml

View all files

Repository files navigation

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Official implementation and model checkpoints for Orthrus , a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.

demo_orthrus.mp4

Model Zoo

All models use a Qwen3 backbone and guarantee strictly lossless generation .

Model Base Model HuggingFace Avg. Speedup

Orthrus-Qwen3-1.7B Qwen3-1.7B 🤗 HuggingFace 4.25×

Orthrus-Qwen3-4B Qwen3-4.0B 🤗 HuggingFace 5.20×

Orthrus-Qwen3-8B Qwen3-8.0B 🤗 HuggingFace 5.36×

Installation

uv pip install -e . uv pip install ninja packaging uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it

We recommend uv for fast dependency resolution.

Quickstart

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model = AutoModelForCausalLM.from_pretrained( "chiennv/Orthrus-Qwen3-8B", dtype=torch.bfloat16, device_map="cuda", attn_implementation="flash_attention_2", # use flash_attention_4 if your system does support trust_remote_code=True, ).eval() tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B")

prompt = "Write a program to count the frequency of each word in a paragraph." messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids

output_ids = model.generate( input_ids=input_ids.to(model.device), max_new_tokens=2048, use_diffusion_mode=True, streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation

Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!

Key Advantages

Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks.

Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.

Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.

Parameter Efficient: Parallel generation capabilities are injected by fine-tuning only 16% of the total model parameters while keeping the base LLM strictly frozen.

Performance Comparison: Orthrus vs. Speculative Decoding

Orthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales.

Left: Average verified tokens per forward pass compared to EAGLE-3 and DFlash. Right: Simulated generation time across scaling context lengths compared to DFlash.

Comparison with State-of-the-Art Diffusion Models

While recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity.

Throughput vs. Accuracy on MATH-500. Orthrus delivers a ~6x speedup over...

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast