Show HN: We built an LLM inference engine in pure Python – no PyTorch, no Triton

zyoraclub1 pts0 comments

Release v2.0.0 — The Own Everything Release · Zyora-Dev/zse · GitHub

//releases/show" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//releases/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

Zyora-Dev

zse

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star<br>151

v2.0.0 — The Own Everything Release

Latest

Latest

Compare

Choose a tag to compare

Sorry, something went wrong.

Filter

Loading

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

No results found

View all tags

zyoraclub

released this

22 May 11:47

&middot;

4 commits

to main<br>since this release

v2.0.0

55a9742

ZSE v2.0.0 — The "Own Everything" Release

A complete rewrite. Zero third-party dependencies. No PyTorch, no Triton, no transformers, no bitsandbytes. Pure-Python kernel compiler emits CUDA C, HIP C, and Metal Shading Language directly.

Install size: ~3 GB → ~5 MB .

Headline numbers (Qwen2.5-14B INT4 vs vLLM AWQ INT4 on A100-80GB)

Metric<br>ZSE<br>vLLM

Cold start<br>6.29s<br>127.02s<br>20.2×

VRAM used<br>12.28 GB<br>71.45 GB<br>5.82× less

Single-seq tok/s<br>37.0<br>26.5<br>1.40×

Validated on 6 platforms

GPU<br>Cold start<br>vs vLLM AWQ INT4 cold

NVIDIA T4 (sm_75)<br>7.25s<br>30.2× faster

NVIDIA L4 (sm_89)<br>5.58s<br>26.0× faster

NVIDIA A10G (sm_86)<br>6.01s<br>32.1× faster

NVIDIA A100-80GB<br>6.29s<br>20.2× faster

AMD MI300X<br>3.14s<br>13.6× faster (vs vLLM-ROCm FP16)

Apple M1<br>E2E vector_add validated, full inference pending

Install

pip install zse-engine<br>zse serve model.zse --port 8000

Or run the kernel compiler standalone:

pip install zse-compiler

What's in this release

ZSE Kernel Compiler — @zse.kernel Python DSL → CUDA / HIP / Metal. Warp primitives, vectorized memory, block reductions, tiling, fusion, WMMA, CDNA3 MFMA matrix cores, auto-tuning.

.zse model format v2 — pre-quantized INT4/INT8/FP16, mmap-friendly, C-accelerated quantization (~600× faster). Adapters for Llama / Mistral / Qwen2 / Gemma2 / Phi3.

Own PagedAttention — adaptive block sizing, token-level eviction, FNV-1a dedup, COW forking.

ZStreamer — continuous batching, disaggregated prefill/decode, chunked prefill, speculative decoding (n-gram + self-draft).

Orchestrator — unified VRAM allocator, 29 GPU kernels on MI300X, CUDA Graphs + HIP Graphs, LoRA hot-swap.

Server — OpenAI-compatible API, API key auth, rate limiting, SQLite store, built-in RAG (/v1/rag/*), web dashboard.

RAG — BM25 + TF-IDF + dense embeddings (via the loaded LLM, zero extra deps) + Reciprocal Rank Fusion + LLM cross-encoder rerank.

Tensor Parallelism — pure-ctypes NCCL/RCCL wrapper, multi-process workers.

Breaking changes

Package rename: zllm-zse → zse-engine on PyPI

Module rename: zse → zse_engine

.zse format v2 is incompatible with 1.x — re-convert with zse convert

bnb / bitsandbytes backend removed

PyTorch / Triton / transformers dependencies removed

Full migration guide and detailed change log: CHANGELOG.md

Acknowledgments

AMD MI300X validation, 32B-parameter benchmarks, and our ROCm wave-64 kernel development were made possible by DigitalOcean's Open Source Sponsorship Program .

447 tests passing. Zero dependencies. Three GPU backends. One package.

Assets

Loading

Uh oh!

There was an error while loading. Please reload this page.

-->

All reactions

You can’t perform that action at this time.

release faster search reload kernel loading

Related Articles