GitHub - chiennv2000/orthrus: Fast, lossless LLM inference via dual-view diffusion decoding. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
chiennv2000
orthrus
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star<br>21
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>11 Commits<br>11 Commits
assets
assets
src
src
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
pyproject.toml
pyproject.toml
View all files
Repository files navigation
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Official implementation and model checkpoints for Orthrus , a dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models.
demo_orthrus.mp4
Model Zoo
All models use a Qwen3 backbone and guarantee strictly lossless generation .
Model<br>Base Model<br>HuggingFace<br>Avg. Speedup
Orthrus-Qwen3-1.7B<br>Qwen3-1.7B<br>🤗 HuggingFace<br>4.25×
Orthrus-Qwen3-4B<br>Qwen3-4.0B<br>🤗 HuggingFace<br>5.20×
Orthrus-Qwen3-8B<br>Qwen3-8.0B<br>🤗 HuggingFace<br>5.36×
Installation
uv pip install -e .<br>uv pip install ninja packaging<br>uv pip install flash-attn --no-build-isolation # or: pip install "flash-attn-4[cu13]" if your device supports it
We recommend uv for fast dependency resolution.
Quickstart
import torch<br>from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model = AutoModelForCausalLM.from_pretrained(<br>"chiennv/Orthrus-Qwen3-8B",<br>dtype=torch.bfloat16, device_map="cuda",<br>attn_implementation="flash_attention_2", # use flash_attention_4 if your system does support<br>trust_remote_code=True,<br>).eval()<br>tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B")
prompt = "Write a program to count the frequency of each word in a paragraph."<br>messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}]<br>input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids
output_ids = model.generate(<br>input_ids=input_ids.to(model.device),<br>max_new_tokens=2048,<br>use_diffusion_mode=True,<br>streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation
Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned!
Key Advantages
Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks.
Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model's exact predictive distribution.
Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead.
Parameter Efficient: Parallel generation capabilities are injected by fine-tuning only 16% of the total model parameters while keeping the base LLM strictly frozen.
Performance Comparison: Orthrus vs. Speculative Decoding
Orthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales.
Left: Average verified tokens per forward pass compared to EAGLE-3 and DFlash. Right: Simulated generation time across scaling context lengths compared to DFlash.
Comparison with State-of-the-Art Diffusion Models
While recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity.
Throughput vs. Accuracy on MATH-500. Orthrus delivers a ~6x speedup over...